AttributeError: 'Doc2Vec' object has no attribute 'syn0_lockf'

dkonopnicki commented 9 years ago

Hi, Not sure if it allowed to do that: model = gensim.models.Doc2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True) array = model.(doc_words=words) but I get this error: Traceback (most recent call last): File "E:/Sites/PythonProjects/SmallTwitter/createWord2Vec.py", line 40, in array = model.infer_vector(doc_words=words) File "C:\Anaconda3\lib\site-packages\gensim\models\doc2vec.py", line 696, in infer_vector doctag_vectors=doctag_vectors, doctag_locks=doctag_locks) File "C:\Anaconda3\lib\site-packages\gensim\models\doc2vec.py", line 133, in train_document_dm word_locks = model.syn0_lockf AttributeError: 'Doc2Vec' object has no attribute 'syn0_lockf'

Can you help?

gojomo commented 9 years ago

What you're attempting isn't supported, for a couple reasons.

The GoogleNews word vectors, in the raw format of the original word2vec.c code, are not a full Doc2Vec ("Paragraph Vectors") model – they're just a list of word vectors. That's not enough for Doc2Vec training (and inference is a form of constrained training).

Overall it doesn't make much sense to load these word-vector lists into a Doc2Vec instance... but it can initially succeed, and still work for simple purposes, like looking-up word vectors or performing some word-similarity calculations/rankings. Still, those supported actions can (and should) be done by loading the word2vec.c-format vectors into a Word2Vec instance.

We could add a clearer error when a Doc2Vec model doesn't have enough state to do infer_vector().

guy4261 commented 9 years ago

Hi, Just came across the same error.

May I ask why isn't this sufficient?

From what I got out of Mikolov and Quoc Le's paper - each paragraph vector is learned from the windows of words you can derive out of it, and regardless of having any other paragraph vectors available.

Thus I expected infer_vector itself to only use the already-studied words to infer the paragraph vector (using a randomly-initialized paragraph vector as if it was an additional word used in inferring the next word in the window).

So can you please elaborate on why it won't work (algorithm-wise)?

EDIT: in the code, it seems that:

import numpy
# remove the syn0_lockf is missing error.
# If I got it right, then numpy.ones() unlocks the word vectors, numpy.zeroes() locks.
model.syn0_lockf = numpy.zeros(len(model.syn0), dtype=numpy.float32)
# turn off hierarchical sampling, which removes the 'syn1 is missing error'
model.hs = 0

Allows infer_vector to run (and return somewhat logical results) and for the examples to keep on working (i.e. king - man + woman == queen).

from scipy.spatial.distance import cosine
res1 = model.infer_vector("man in japanese restaurant".split())
res2 = model.infer_vector("woman in japanese restaurant".split())
res3 = model.infer_vector("the cat sat on the window sill".split())
print cosine(res1, res2) # -> 0.989338875748
print cosine(res2, res3) # -> 1.04789436609
print cosine(res1, res3) # -> 1.05761141703

gojomo commented 9 years ago

The issue is the missing hidden-layer (syn1 if using hierarchical-softmax; syn1neg if using negative-sampling). That's what's needed to turn context word/doc vectors into target-word predictions (in the forward-propagation) and then adjust the hidden-layer and context vectors (via back-propagation) whenever the predictions aren't perfect.

Since the word2vec.c format only has the context vectors – the inputs to the prediction neural net, not its inside layer – there's not enough there to continue training, or to do inference (which is essentially a highly-constrained form of training).

I think all you're seeing with your hand-patched model is the randomly-initialized vectors that have received no inference-training. (If hs=0 and negative=0 then both training modes are off, train/infer calls will be no-ops, and calls to infer_vector() will just return the meaningless random seed vector that would, under normal operation, be improved to a usable vector via working inference. Try passing in just an empty list – [] – and I think you'll see similar vectors/distances as your examples.)

OK, now to something speculative that's more in the space of "possible research directions":

At one point testing the code, I had a bug that zeroed the hidden layer after each iteration over the training data. And, my test setup (the demo IMDB notebook) evaluated the quality of the doc-vectors in the downstream sentiment-prediction task after each iteration.

Quite to my surprise, even with the bug that zeroed the entire hidden layer, the doc vectors still tended to get incrementally better after each training pass. (Thought: not as fast as with a persistent hidden layer.) That is, even with all the NN's word-predictive powers (the hidden layer) destroyed after each pass over the corpus, the mere fact that the input context vectors started out slightly 'better' (sensibly reflecting something about the underlying text) on the next pass quickly allowed the hidden-layer to become meaningful, and the context vectors to continue improving.

Among other things, this suggests:

you could take pretrained vectors without a hidden layer (like the GoogleNews set), load them up, lock them against changes, initialize an empty hidden layer, and then train over a corpus to bootstrap a new hidden layer. It might not be as good as the original hidden layer (unless using the same 100B-word corpus), but the resulting model might be good for something.
other perturbations to the hidden layer (like dropout or other damage/constraints) during training might have interesting effects on the quality of the final model

guy4261 commented 9 years ago

Thanks for this detailed reply! It would take me some time to grok it though.

I'd say that my dream is to preload trained word vectors (that is - the GoogleNews binaries) and use them to either train a doc2vec model or just infer vectors for new paragraphs. But I need to do some more reading - both of your reply and the papers - to understand if that's possible or just a result of my misunderstandings...

Thanks again!

g.

guy4261 commented 9 years ago

Two more quick questions (since you've got such great answers):

1) With no syn0 that means I can't use my word vectors to predict the next word? Only as vectors that can be added/subtracted to get king-man+woman==queen stuff? (and cosmul operations too, of course...) .

2) I know that hierarchical softmax and negative sampling are tricks to speed up training (from here: https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc/edit ) My question is - it seems that gensim uses HS by default. Why? Is it faster / more accurate? If it's a try-and-see-for-myself case, that would also be a good answer :)

gojomo commented 9 years ago

(1) With no syn1 (HS) or syn1neg (negative-sampling), there's no neural-net. So, no way to attempt target-word predictions or to gradually improve the predictions over training cycles.

Note, though, that projects generally don't try to use the actual word-predictions from the net or even care much about how good, on an absolute scale, the NN gets at predicting words. Is its top prediction right 99% of the time? Or 1% of the time? We don't care – and I haven't even seen a codebase that offers direct access to those predictions or stats. (It'd be possible but expensive.) It's just that the attempt to make the NN better – no matter how good or bad it gets – causes the interesting word/doc vectors to be created.

(2) Most of the gensim defaults are arbitrary/historical-accidents – they weren't consciously chosen as best settings. So definitely try variants. Most of the published work I've seen on big datasets seems to prefer negative-sampling.

piskvorky commented 9 years ago

You can get prediction scores (probabilities) using the score_ functions of a trained word2vec object. For example, model.score(sentences) will predict the (log) probability for a sequence of sentences. See tutorial here.

Gensim defaults followed the defaults of the C original word2vec at the time it was ported. I think the C defaults have changed since, so I'm not sure they still match though. Hierarchical softmax gets better scores than negative sampling on the same input, so it's the default. Note though that HS much slower than NS on the same input, too.

gojomo commented 9 years ago

Note the probabilities from the score_ functions are for whole sentences – they don't provide a way to get the most-likely word, or most-likely-N-words, given a context. (To get that, you'd essentially have to run every word in the vocabulary through – the "possible but expensive" option.) Also, the score_ functions only work for hierarchical-sampling and as far as I know, they haven't been tested with Doc2Vec models (though they might work, or be fairly easy to get working).

Some of the word2vec.c defaults that have changed since the original port are:

use skip-gram instead of CBOW
5 negative-sample examples instead of hierarchical softmax
mild frequent-word downsampling (value 1e-04) instead of no downsampling

We could certainly match those, but it might break/degrade user code that depends on the old choices.

Since there isn't yet strong evidence for a "obviously best for most purposes" set of parameters, I wouldn't mind forcing users to supply explicit choices. (The current setup is nice for quick-starts and compact code examples, but seems to have led people to be overconfident in the wisdom of the defaults. If forced to choose, most would still just start out by copying the choices in any prominent examples or mentioned in the doc-comments, but they might be more aware of how tentative those implied values are.)

In my Doc2Vec testing, hierarchical softmax only seemed to outperform negative-sampling on small-data, short-training tests. It gets off to a faster start, but with patience negative-sampling reaches better evaluation scores.

guy4261 commented 9 years ago

Thanks for the lengthy replies!!!!!

I've read them and have been led to another question: Is it possible to train a Word2Vec model using gensim (not the original C tool!) then load it into a gensim Doc2Vec model and use it for inferring document vectors?

(That is similar to what is described in Mikolov and Quoc Le's paper as gradient descending on D while holding W, U, b fixed.)

In gensim, if I trained a Word2Vec model and saved it, trying to load it using Doc2Vec.load() returns a Word2Vec object (logical, since the Doc2Vec class extends Word2Vec class). However, the loaded Word2Vec model also doesn't contain syn1 , even though it was trained using gensim!!!

If I still have the word2vec model loaded in memory, with its syn1 initialized properly, can I perhaps copy it into Doc2Vec? I am using gensim 0.12.2 .

Thanks again!!!

gojomo commented 9 years ago

The Doc2Vec infer_vector() method is intended to be exactly the Mikolov-Le gradient-descent-while-holding-the-model-fixed.

In my tests, if there was a syn1 before the save(), it's back after the load(). Note that syn1 won't exist at all if using hs=0 mode. (Instead, negative-sampling using syn1neg.)

You could try to copy/move the state from a Word2Vec instance into a Doc2Vec instance, and plausibly, with analogous Doc2Vec settings, the resulting model would to do a form of doc-vector inference. (I haven't tried it so other code/state patching might be necessary.)

However, compared to a true Doc2Vec model (where full-document vectors also influenced the weights), such a model might have less expressiveness, having become more specialized in word-to-word predictiveness. It'd be worth testing both variants for your intended purpose.

tmylk commented 8 years ago

@guy4261 Does this sound good to you? Can this issue be closed?

tmylk commented 8 years ago

Closing as abandoned

piskvorky / gensim

AttributeError: 'Doc2Vec' object has no attribute 'syn0_lockf' #501