Closed bkj closed 8 years ago
Hi Ben,
Glad you are answered your question - can this issue be closed now?
Also such open questions are better suited for our Google Group mailing list.
OK -- will post there in the future. Thanks.
As you've observed, the word and doc vectors are stored separately and searched for lists-of-most-similar vectors separately. There's a forum message with some examples that overlap with what you've discovered, as well as showing how to merge results.
But also note that tiny documents (of just one or a few words) are kind of challenging for Doc2Vec, and at the very least may need more training/inference iterations to achieve interpretable results. (Consider: with a DBOW model, inferring a vector for a 100-word document with the default 10 passes
means 1000 training-nudges from the doc-vectors random initial state: one for each word for each pass. Inferring a vector on a 1-word document means just 10 training-nudges.) Also, many have observed better inference behavior with a much larger passes
value (and sometimes a smaller initial alpha
).
Hi All,
I've trained a
Doc2Vec
model and I'm trying to query the documents by keyword.The word embeddings look pretty good:
but a strange thing happens when I use a word to find the most similar docvecs:
They're switched! This also happens for
(father, husband)
. Does anyone have any idea why this would happen? Perhaps the document vectors are slightly displaced relative to the word vectors? Any ideas on how to address this?The model was trained w/
Another example, which is maybe illustrative:
An interesting observation is that the other tokens in these most similar sentences are rare -- due to odd punctuation or words from other languages. It seems like the query is returning hits containing words that are frequently in the context of the query word, rather than hits containing the query word itself or words that have similar semantic meaning to the query word.
Thanks Ben
EDIT: More symptoms:
So words that are close to a
doc2vec
embedding ofw
are those that often coocur with thatw
, but words that are close to aword2vec
embedding ofw
are words that ofter appear instead ofw
. I guess the question still remains of how to use words to query docs (and visa versa).EDIT: I've answered my own question:
Apparently the
word
anddoc
spaces have distinct semantic structure, so when you're querying the space you need to make sure that you're in the right space. I'll leave this up in case it's useful to anyone.