Vector arithmetic feature

touretzkyds commented 3 years ago

We want to add a vector arithmetic capability, but to keep the demo from being too intimidating for novices, it should initially be hidden. Please read about accordion displays to learn how to do this. Start here: https://www.w3schools.com/howto/howto_js_accordion.asp

When the user expands the accordion display they should see an equation with three text boxes, and with grayed example text below, like this:

    ________  -  ________  +  ________  ->  ________
      king         man         woman         queen

When the user has entered valid words in the first three boxes, we do the calculation, find the closest matching word, and put that in the fourth box.

In the 3D display we highlight the three input words in blue, add the calculated point as a red circle, and show the closest matching point in bright green. Maybe we'll go a little further and show the 9 next closest points in dark green.

Another thought is we might want to draw an arrow from man to woman and another arrow from king to the result vector. It will be interesting to see how close to parallel these arrows end up.

touretzkyds commented 3 years ago

In keeping with our custom of pre-populating the demo with a working example, when the user opens the vector arithmetic portion of the display, it should be pre-loaded with the king - man + woman = queen example. But we should keep the grayed prompt words beneath the slots in case the user wants to try their own example and needs reminding of how it works.

jxu commented 3 years ago

I was looking into the expandable displays which should be possible using the relatively new <summary> HTML tag. It is possible to add animations using pure CSS too. Also I am looking at using placeholder text instead of example text below which should be still intuitive.

jxu commented 3 years ago

I was having some issues with HTML forms and getting my vector arithmetic to work correctly. But now it should. The CSS will be handled soon. Btw the word that matches is often the original word even after subtracting and adding two words. Since it is uninteresting to have a word be most similar to itself, this is excluded.

jxu commented 3 years ago

Firefox will keep the previously entered words of an input box on refresh. Not sure if this is a bad thing or not, but probably the resulting word should be cleared or ask users to use Ctrl-F5.

touretzkyds commented 3 years ago

Btw the word that matches is often the original word even after subtracting and adding two words. Since it is uninteresting to have a word be most similar to itself, this is excluded.

Do not do this. Follow the spec I wrote. Display a point for the actual output of the vector arithmetic operation, and show us the closest matching point that corresponds to a known word.

There is still a lot of display stuff to be done, e.g., the six slots need to be updated as described in the spec. And the words in the equation need to be highlighted in color in the 3D display. Why is this taking so long?

jxu commented 3 years ago

Sorry for the delay. I will be getting to those features. In #1 I described how the page will freeze for maybe 10 seconds every time I load due to unpacking the vectors. I've been reading through my JS book and it and MDN docs tell me that I can move computation to another thread using Web Workers. On a slower computer the page may freeze for a longer time which is unpleasant looking but doesn't have an effect on actual load times.

jxu commented 3 years ago

Do not do this. Follow the spec I wrote. Display a point for the actual output of the vector arithmetic operation, and show us the closest matching point that corresponds to a known word.

The closest matching word to king after arithmetic is still king. The next closest is queen

touretzkyds commented 3 years ago

Then we're doing something wrong. Maybe renormalizing vectors when we shouldn't be. Let's display the intermediate results of the computation and see what they look like.

jxu commented 3 years ago

In my ipython notebook I do some calculations with all unnormalized vectors and king really is the closest word to king - man + woman by cosine similarity (3COSADD). The same result shows up for normalized vectors.

king 0.8539592374184025
queen 0.6562503967891941
kings 0.595450621264334
lynn 0.5691271748575817
princess 0.5533240160607099
monarch 0.5509679223701061
woman 0.5370790603968242
burger 0.5342758965247524
majesty 0.5295283809180157
queens 0.529066482424086

(funnily enough, "burger" made its way in)

jxu commented 3 years ago

Since a point could have multiple roles, I put into place a coloring priority system. There are a few complications in adding arithmetic words to the scatterplot, including not having calculated nearest words (should be ok). I still think it is most flexible to treat special arithmetic words as pseudo-words with a word to vector mapping.

The selected point is always shown as red, then the pseudo-word result of vector arithmetic is pink, then nearest word is in green, then words involved in arithmetic are blue, then the rest are black.

touretzkyds commented 3 years ago

I tried the latest version of the code. The display looks nice.

In the 3D plot, king-man+woman is much closer to queen and woman than it is to king. But I understand that in the full 300-dimensional feature space this is not the case. Interesting.

Looking at the feature vectors, king less similar to man than I expected, and woman is also less similar to man than I expected. As a result, (woman-man) has a lot more nonzero elements than I expected. Semantically these three words differ in only a single feature, but the embeddings reflect the fact that these words occur in very different contexts. I think the problem is worse for this 300 element embedding compared to the original 100 element dataset we started with.

Can we try switching back to the original dataset and see if vector arithmetic works better there?

jxu commented 3 years ago

I added an experiment to the ipython nb with the old vector model (wordvecs77000.json)

king is still the closest word in the arithmetic:

king 0.846306552010454 queen 0.7335261779290196 prince 0.70112989695682 emperor 0.6842812566496257 empress 0.676760579698921 throne 0.6742607999288218 monarch 0.669020535149873 heir 0.6616695849405633 aragon 0.6599179621530367 pharaoh 0.6531471520045711

jxu commented 3 years ago

In the literature on 3COSADD analogy it is stated to exclude the words used in the arithmetic

Screenshot from 2021-07-28 20-59-48

Levy, Goldberg, Dagan 2015. Improving Distributional Similarity with Lessons Learned from Word Embeddings

citing Omer Levy and Yoav Goldberg. 2014b. Linguistic regularities in sparse and explicit word representations .In Proceedings of the Eighteenth Conference on Computational Natural Language Learning,pages 171–180, Baltimore, Maryland.

touretzkyds commented 3 years ago

Very interesting. Okay, I guess we should exclude the arithmetic words when looking for the answer, though I'm still puzzled about why this is necessary. It gives me the feeling that these embedding vectors are only weakly encoding the pure meaning of the word; they are more strongly representing co-occurrence statistics, which differs a lot from pure meaning when words are commonly used idiomatically, as king and queen are. I tried finger-hand+foot = ? and got back foot instead of toe, but in the 3D plot, toe was closer to the arithmetic result than foot was. So the 3D plot is misleading. And of course, "finger", "hand", and "foot" all have multiple meanings and are both nouns and verbs, so focusing on just the body part meaning is unreasonable. Do the publications you're referencing say how many of these analogies the vector arithmetic approach solved correctly? By the way: when I hover over the node for the arithmetic result I expect to see the list of 10 closest words, but I don't. Why not?

jxu commented 3 years ago

The word analogies arithmetic originally comes from Mikolov 2013 "Linguistic regularities in continuous space word representations". From abstract: "We find that these representations [neural-network language models] are surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset" and cites a nearly 40% accuracy number for semantic analogy questions. The original paper uses normalized vectors and I don't think explicitly excludes words used in the arithmetic, but later papers do explicitly exclude those.

I should note the model currently I use is the newer (2016) fasttext which is based on a simpler bag of n-grams model (from the original fasttext Bag of Tricks paper) which the authors say have similar accuracy scores in word tasks while being significantly faster. According to the fasttext model with subwords, they achieve accuracy of 70% using the Mikolov analogy dataset https://arxiv.org/pdf/1607.04606.pdf

I did not include the 10 closest words because they were not precomputed. Are these supposed to be the ones computed on the fly?

touretzkyds commented 3 years ago

I did not include the 10 closest words because they were not precomputed. Are these supposed to be the ones computed on the fly?

Yes. Since you have to find the closest matching word to complete the arithmetic computation, it should be trivial to compute the 10 closest and make those available to the hover display.

40% accuracy on word analogies is pretty good, but in demo terms, it means 6 out of 10 things people try will end up not working. Maybe we have to live with that. Maybe we can publish a list of analogies that we know will work. I think that will be very useful.

Another idea is to make the problem easier by offering four candidate answers and asking the system to choose among them, e.g., "person : feet :: horse : [mane / hooves / saddle / reins],

jxu commented 3 years ago

This is the dataset used in the SemEval 2012 task 2 which the Mikolov 2013 paper on vectors for analogies used https://www.cs.york.ac.uk/semeval-2012/task2.html

Here are analogies I tested that work when excluding the original words. The system works well for simple semantic relations (for a:b as c:d, I put in a-b+d and expect to get c)

king:man queen:woman dog:puppy cat:kitten father:son mother:daughter hand:finger foot:toe (your example) pen:ink pencil:graphite above:below right:wrong (also left could've worked) positive:negative left:returned (I was thinking of left:right) is:am sits:sit leg:foot arm:hand car:volkswagen beer:coors

failures: flower:daisy [flowers]:oak hand:arm [legs]:leg tree:oak [lucy]:daisy tree:oak [orchids]:orchid flock:birds [sheep]:cattle

touretzkyds commented 3 years ago

This is a pretty good result. Thanks.

touretzkyds commented 3 years ago

husband - man + woman = wife

This works, but the husband-man+woman vector appears pretty distant from the wife vector in the 3D plot. The residual is large. Words like aunt and mother appear closer than wife in the 3D plot. I'd like to understand why this is happening.

Maybe our gender and age basis vectors are distorted because we're using some words like "man" and "queen" that have multiple meanings or idiosyncratic usage. Perhaps they could be made more pure by a more careful selection of contrasting pairs, e.g., aunt/uncle is probably pretty safe, as is niece/nephew. Can we try refining our basis vectors to see if this improves things?

"man - woman + wife = husband" is another analogy that the system gets right but the display puts husband pretty far from the computed vector.

jxu commented 3 years ago

I can see how man and queen are kind of generic words that have multiple meanings. I'm just not sure how projecting onto dimensions would change distances. The original similarity is measured by cosine similarity in 300 dimensions, and the three dimensions chosen to project down to don't necessarily preserve distance or angle appearance.

For the given words, most appear opposite on the gender axis, but curiously "husband" appears near zero on the gender axis. If husband were closer to "man-woman+wife", we would get the parallelogram of vector addition we expect to see.

touretzkyds commented 3 years ago

The residual is obtained by subtracting the age projection and the gender projection from the original word vector. If we pollute the age or gender basis vectors with irrelevant features, those features will change the magnitude of the projection along those basis vectors and thus move the point's position along those axes. In addition, the extraneous features will get subtracted out and not contribute to the residual.

jxu commented 3 years ago

I tried replacing man/woman with grandfather/grandmother. This moves husband towards the male direction of gender but it is still close to the center. Duke/duchess also have good gender separation but this still doesn't improve the placement of husband.

touretzkyds commented 3 years ago

There may not be much we can do about husband; it's a verb as well as a noun. And there's the cliches ike "jealous husband" and "hen-pecked husband".

Don't worry too much about tweaking the basis vectors at this point. I can do that later. Let's just get the code finished before you run out of time.

jxu commented 3 years ago

Maybe the issue is how the vectors are normalized? Currently the result v_b - v_a is not normalized, but y = v_b - v_a + v_c is.

Screenshot from 2021-08-15 00-16-40

Without normalization for y:

Screenshot from 2021-08-15 00-18-14

touretzkyds / oldWordEmbeddingDemo

Vector arithmetic feature #12