idio / wiki2vec

Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby
601 stars 137 forks source link

Concept doesn't exist in trained model - fallback to a nearest neighbour #21

Closed vondiplo closed 8 years ago

vondiplo commented 8 years ago

Let's say I'm looking for specific concept which doesn't exist per-say in the model file. How would I be able to find a nearest representative vector based on currently existing vectors?

@dav009 (I hope it's alright I'm tagging you here, thought of grabbing your attention if you're still available).

Thanks a lot!

dav009 commented 8 years ago

Well, I guess you could try combining the vectors of the words composing the name of the entity i.e: vladimir + putin. You could also play with going to wikidata/dbpedia and playing with the vectors of the entites around it. i.e: check vladimir_putin on freebase/wikidata/dbpedia get its type(politician, russian..), and related entities. Play with their vectors..

ideally all entities should get their own vector, if they don't appear is due to :

vondiplo commented 8 years ago

I see, I guess that this fix isn't reflected back into the pre-compiled models then (I tried querying for the vector visual_cortex with no success) ? Also, what kind of support do plurals and different tenses (of verbs) have? Would they just usually appear as their own vector or reflect back to some base (I know that on wikipedia you sometime have a redirect from one article to another).

phdowling commented 8 years ago

Regarding vocab size limitation: Pretty much, some vectors will almost always be missing since their words don't occur many times in wikipedia. You can set min_count to 1 (or use a special vocab trimming rule), but it's unlikely that word2vec will generate a very good vector for a word that it only sees once in the training data.

To address @vondiplo's second question: As such, the saved models are "dumb", nothing about them will change just by querying for vectors, and no words are automatically combined or disambiguated. If you want a verb in a specific tense in the vocabulary, you'll basically have to make sure it occurs in the vocabulary in exactly that form a sufficient number of times. A way to get around this a bit is to train the model on stemmed tokens, and to stem all query tokens as well. The vectors may be less "precise", but you'll probably have quite fewer vocabulary misses, and often that may be preferable.

For the first question: what @dav009 says, basically. Vectors for words and entities that co-occur with the entity you are looking for, along with the raw tokens of the surface form are probably your best guess.

I think Gensim will soon add (has added?) support for online word2vec, which would allow growing the vocabulary at a later stage. This might be interesting for this particular use case, but I'm not sure what the status is there.

vondiplo commented 8 years ago

@dav009 and @phdowling thanks you very much for your responses. A feature like you described by Genism could be superb.