piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.63k stars 4.37k forks source link

Comparing document and word embeddings #848

Closed bkj closed 8 years ago

bkj commented 8 years ago

Hi All,

I've trained a Doc2Vec model and I'm trying to query the documents by keyword.

The word embeddings look pretty good:

>>> pprint(m.most_similar('mother'))
[(u'mum', 0.7254319190979004),
 (u'mommy', 0.6660343408584595),
 (u'mummy', 0.6226981282234192),
 (u'mom', 0.5665115714073181),
 (u'momma', 0.5245990753173828),
 (u'mumma', 0.5145266056060791),
 (u'mam', 0.4907728433609009),
 (u'parent', 0.4873749613761902),
 (u'mammy', 0.47532182931900024),
 (u'auntie', 0.45814812183380127)]

>>> pprint(m.most_similar('wife'))
[(u'wifey', 0.6948119401931763),
 (u'fianc\xe9e', 0.58426833152771),
 (u'grandmother', 0.5661832690238953),
 (u'grandma', 0.5495551824569702),
 (u'chucker', 0.5271111726760864),
 (u'housewife', 0.5199964642524719),
 (u'daddy', 0.5133747458457947),
 (u'spouse', 0.5079671740531921),
 (u'fianc\xe9', 0.507759690284729),
 (u'fiance', 0.5044819116592407)]

but a strange thing happens when I use a word to find the most similar docvecs:

mtcs = m.docvecs.most_similar([m['mother']])
pprint(list(df.loc[[x[0] for x in mtcs]].body))

[u'wife of @, mom of incan , ijo , and dina .',
 u'wife , godmother , friend , #talent & #organizational consultant , #career #coach , facilitator of #learning',
 u"evan's wife",
 u'amateur lawyer , cook , and wife .',
 u"kudo's wife",
 u"-25 years old | -officially ammir's wife",
 u"a soldier's wife",
 u'wife of @',
 u'wife , mother , laborer',
 u"adhe's wife"]

mtcs = m.docvecs.most_similar([m['wife']])
pprint(list(df.loc[[x[0] for x in mtcs]].body))
[u'photographer , artist , mother , sister , daughter .',
 u'mother',
 u'a mom of two .',
 u'mother',
 u'heiltsuk mother ...',
 u'mother keder',
 u'a mother',
 u'mother , ex-sailor , mother , ex-writer , mother , ex-teacher , mother and wife and proud .',
 u'a bitch , a lover , a child and a mother',
 u'mother , mama , mummy , mom \u2661']

They're switched! This also happens for (father, husband). Does anyone have any idea why this would happen? Perhaps the document vectors are slightly displaced relative to the word vectors? Any ideas on how to address this?

The model was trained w/

gensim_params = {
    "dm"        : 1,
    "hs"        : 1,
    "negative"  : 0,
    "dm_mean"   : 0,
    "size"      : 256,
    "window"    : 10,
    "min_count" : 20,
    "workers"   : 10,
    "iter"      : 20
}

Another example, which is maybe illustrative:

mtcs = m.docvecs.most_similar([m['scientist']])
pprint(list(df.loc[[x[0] for x in mtcs]].body))
[u'computer scientist--lau',
 u'computer engineer/scientist',
 u'elektrotehnicar computer',
 u'computer',
 u'computer whiz~disciplined~neat~shy~gadget geek~stubborn',
 u'computer and printer technology specialist since 1985 . avid gamer , digital artist and published author .',
 u'liton computer',
 u'computer interlactual',
 u'biomedical __black_heart_suit__ engineer | dragonboat athlete',
 u'computer engineer/footballer']

mtcs = m.docvecs.most_similar([m['computer']])
pprint(list(df.loc[[x[0] for x in mtcs]].body))
[u'consultant engineer',
 u'science , math , programming & electronics educator since 1990',
 u'it consultant engineer',
 u'15/smnhsian/selfiequeen',
 u'\u2022bs math-computer science \u2022bulsu \u2022sana makaya',
 u'consultant engineer',
 u'science',
 u'science',
 u'student at kuniv - college of science',
 u'science',
 u"~~marine science '12~~universitas brawijaya~~1/7 belatung~~kpopers"]

An interesting observation is that the other tokens in these most similar sentences are rare -- due to odd punctuation or words from other languages. It seems like the query is returning hits containing words that are frequently in the context of the query word, rather than hits containing the query word itself or words that have similar semantic meaning to the query word.

Thanks Ben

EDIT: More symptoms:

>>> m.most_similar([m.infer_vector(['sports'])])
[(u'rehabilitator', 0.6027324199676514), (u'betting', 0.5178006887435913), (u'bettor', 0.49929359555244446), (u'fanatic', 0.496326744556427), (u'extreme', 0.493328720331192), (u'finatic', 0.4258161783218384), (u'opta', 0.4249700903892517), (u'notch', 0.4159475862979889), (u'nut', 0.4037990868091583), (u'hgv', 0.4021718502044678)]
>>> m.most_similar([m['sports']])
[(u'sports', 1.0000001192092896), (u'sport', 0.6506721377372742), (u'motorsports', 0.5084659457206726), (u'cricket', 0.49583980441093445), (u'e-sport', 0.4924854040145874), (u'football', 0.49159306287765503), (u'e-sports', 0.48690176010131836), (u'pro-wrestling', 0.48110586404800415), (u'ice-hockey', 0.47393375635147095), (u'motorsport', 0.469654381275177)]

So words that are close to a doc2vec embedding of w are those that often coocur with that w, but words that are close to a word2vec embedding of w are words that ofter appear instead of w. I guess the question still remains of how to use words to query docs (and visa versa).

EDIT: I've answered my own question:

m.docvecs.most_similar([m.infer_vector(['mother'])])

Apparently the word and doc spaces have distinct semantic structure, so when you're querying the space you need to make sure that you're in the right space. I'll leave this up in case it's useful to anyone.

tmylk commented 8 years ago

Hi Ben,

Glad you are answered your question - can this issue be closed now?

Also such open questions are better suited for our Google Group mailing list.

bkj commented 8 years ago

OK -- will post there in the future. Thanks.

gojomo commented 8 years ago

As you've observed, the word and doc vectors are stored separately and searched for lists-of-most-similar vectors separately. There's a forum message with some examples that overlap with what you've discovered, as well as showing how to merge results.

But also note that tiny documents (of just one or a few words) are kind of challenging for Doc2Vec, and at the very least may need more training/inference iterations to achieve interpretable results. (Consider: with a DBOW model, inferring a vector for a 100-word document with the default 10 passes means 1000 training-nudges from the doc-vectors random initial state: one for each word for each pass. Inferring a vector on a 1-word document means just 10 training-nudges.) Also, many have observed better inference behavior with a much larger passes value (and sometimes a smaller initial alpha).