jhlau / doc2vec

Python scripts for training/testing paragraph vectors
Apache License 2.0
640 stars 191 forks source link

About getting document vector #26

Closed chhyun closed 5 years ago

chhyun commented 5 years ago

Hello, I'm a very new student of doc2vec, and have some questions about document vector. What I'm trying to get is vector of phrase like 'cat like mammal'. So, what I've tried so far is by using doc2vec pre-trained model, I tried the code below

import gensim.models as g model = "path/pre-trained doc2vec model.bin" m = g. Doc2vec.load(model) oneword = 'cat' phrase = 'cat like mammal' oneword_vec = m[oneword] phrase_vec = m[phrase_vec]

When I tried this code, I could get vector for one word 'cat', but not 'cat like mammal'. Because, word2vec only provide vector for one word like 'cat' right? (If I'm wrong, plz correct me) So I've searched and found infer_vector() and tried the code below

phrase = phrase.lower().split(' ') phrase_vec = m.infer_vector(phrase)

When I tried this code, I could get vector, but every time I get different value when I tried phrase_vec = m.infer_vector(phrase) again and again. Because infer_vector has 'steps'.

When I set steps=0, I get always same vector. phrase_vec = m.infer_vector(phrase, steps=0)

However, I also found that document vector is obtained from averaging words in document. like if the document is composed of three words, 'cat like mammal', add three vectors of 'cat', 'like', 'mammal', and then average it, that would be the document vector.(If I'm wrong, plz correct me)

So... here are some questions.

  1. Is it the right way to use infer_vector() with 0 steps to get vector of phrase?
  2. If it is right averaging vector of words to get document vector, is there no need to use infer_vector()?
  3. What is model.docvecs for?