OOV word solution - Githubissues

LiuShifeng commented 7 years ago

Generally, we construct word vectors with a huge amount of words. But we still face OOV words when applying the trained model. Are there any other ways supported in gensim except seeded_vector()? For example, get topn most possible words by providing the context words.

gojomo commented 7 years ago

For words that weren't seen during training, the Word2Vec model knows nothing. seeded_vector() just gives a random, low-magnitude vector as might be appropriate to use at the start of training.

Theoretically, if you have one or more example surrounding-context(s) for the OOV word, you could conceivably 'infer' a vector for the word, similar to the infer_vector() process used in Doc2Vec, but there's no existing code for that. (It might not be too hard to add, following the example of the Doc2Vec code.)

Without example surrounding contexts, the current model will have nothing to even guess at a vector for an OOV word.

The 'FastText' variant of Word2Vec published by Facebook can train word-vectors that are a combination of both a vector for a full-token, and n-gram substrings of the full-token. In that way, for languages where such substrings are morphemes hinting at meaning, vectors can be synthesized for OOV words, and they're better-than-random at some tasks, especially if the the OOV words are just variants/misspellings of in-vocabulary words. There's some provisional support for using FastText-inside-Gensim in the pre-release develop branch.

(It occurs to me that theoretically, you might be able to deduce/decompose some marginally-useful n-gram vectors from a set of full-token-vectors, after initial training, that might be slightly better than nothing. But there's no code for doing that and I wouldn't expect it to work as well as the FastText approach.)

In some training modes, word-vectors can tend to be a bit like the average of their surrounding words' vectors. So if nothing else is available, and you have an OOV words with example contexts, some combination like an average of those context words might again be better than nothing as a guess for the OOV word vector. But that's kind of speculative.

LiuShifeng commented 7 years ago

@gojomo Thanks for your detailed explanation. It extends my idea about OOV words. :)

lgalke commented 7 years ago

Hey there, I am currently writing my master's thesis about word embeddings in information retrieval. In my setting, the performance (in terms of Mean average precision) improves, when I replace OOV words by some specific vector. For GoogleNews i did chose 'UNK', since in their tensorflow word2vec examples, they also used this one for out-of-vocabulary words while training. For GloVe models, I arbitrarily picked the dot '.', which at least will not push my semantics into the wrong direction.

Another approach would be up-training the existing embeddings, while freezing the original weights. I will probably try it at some point. Gensim also has a method for it: intersect_word2vec_format(...) and then train ~~as usual~~ with min_count=1 :+1:

tmylk commented 7 years ago

Hi @lgalke, that sounds like a very interesting research. We were wandering how effective the up-training is. There is no research about it. Please email student-projects@rare-technologies.com and we will arrange a Skype

piskvorky / gensim

OOV word solution #1131