Open xinxu75 opened 4 years ago
Thanks for the clear & complete issue report.
Note that there's no official support for loading word2vec vectors into a Doc2Vec
model – and trying to cobble it together would require a lot of calling things in non-standard ways, and patching up the resulting models. For example, the error you're getting is because the word2vec vectors you've loaded (via d2v.wv = Word2VecKeyedVectors.load_word2vec_format('previously_saved_w2v_binfile', binary=True)
) doesn't have all the values that would have been initialized by a true, start-of-training vocabulary-discovery phase. (The save_word2vec_format()
, saving only vectors, doesn't even save that info when it's present.)
But further, simply loading vectors from elsewhere doesn't lock them against further updates. If you got past this error, your training would continue to update the vectors. There is another (experimental) feature that might help: a vectors_lockf
value which can be used to down-scale or suppress entirely the updates of certain vectors. You could try saving the entire Doc2Vec
model, then re-loading it (via .save()
and .load()
) – but after re-loading, set every index of d2v.wv.vectors_lockf
to 0.0
. In one way, this should do what you've literally asked for: the word-vectors themselves shouldn't change further.
But it may not have the overall effect you desire, as your new training data will still be adjusting the other internal weights of the model. And, as there's no support for expanding either the word-vocabulary or doc-tag vocabulary of a Doc2Vec
model, only training adjustments to prior doc-tags will be the possible benefits of the new training.
(Separately, also, any followup train()
calls should be using a total_examples
count that matches the current batch of data it's being fed. And it's murky to determine how the learning-rate alpha
should be balanced between older training and newer training, especially if the new data only covers some of the original tokens/tags, as new training may drive internal weights related to new examples arbitrarily far from compatible alignment with internal weights only adjusted in earlier training sessions.)
Why are you interested in continuing training, and specifically continuing training with some aspects of the model frozen against updates? Are you sure that either retraining the whole model, or using just infer_vector()
on new texts (between occasional larger full retrainings), wouldn't be suffiicient?
@gojomo, thanks for the detailed explanation and suggestion. I'll try setting d2v.wv.vectors_lockf=0 and see how it goes.
In the meantime, it appears Gensim older version (3.2.0) supports this more conveniently/elegant. I experimented with Doc2Vec.intersect_word2vec_format with lockf=0 seems to work? - after calling this function, d2v continues updating while wv is no longer updating, which is the desired effect
For your question, the reason why I need this functionality is to experiment with different ways to train wordvecs and docvec -- I want to disentangle the effects on wv and dv, i.e. fix/freeze a given set of wv and see how well different ways to train docvec perform. Within Gensim environment, I only find one way to do so, in 2 steps: first to generate wv & save; second to load/freeze wv to continue training dv (or maybe resetting dv to restart). Is there any other/better way to flexibly "turn on/off" updating wv/dv for this purpose? Much appreciated if you have any suggestion
You could certainly review the source of intersect_word2vec_format()
& manually perform a similar overwriting of an existing model's vector values, and matching use of vectors_lockf[n]=0.0
, to do anything the earlier version did in current versions.
But still: since the other internal weights of the model are being updated by further training – including word->word training (via dbow_words=1
) between words that are individually "locked" – I'd again warn the effects might not be quite what you intend, or easy to characterize. And you still won't be expanding the model to include any new words or doc-tags (since there's no support for expanding the set of recognized doc-tags, and even the vocabulary-expansion support inherited from Word2Vec
has a lingering crashing bug #1019).
@gojomo, I followed your suggestion to load Doc2Vec and set its wv.vectors_lockf=0, but it looks there is no attribute of d2v.wv.vectors_lockf (AttributeError: 'Word2VecKeyedVectors' object has no attribute 'vectors_lockf')? This happens in 3.8.1
Instead there appears to be d2v.syn0_lockf - Just wondering if this is np array to be set=0.0? Thanks for your help
I think it may actually be any _lockf
arrays in the d2v_model.trainables
object - one will be for the word-vectors, another will be for the doc-vectors. (And, anything at d2v_model.syn0_lockf
might just be a compatibility pass-through, or other redundancy.)
I know that Gensim word2vec supports online learning, and doc2vec uses word2vec internally. So I was just wondering, suppose
Now, wouldn't the pre-computed (stored) doc vectors for the set of docs, set_1 be outdated? [Because after the 2nd training period, the model might have added new word vectors/modified existing ones]
If yes, then how can doc2vec be used on big data, will it be required to compute the doc vecs after each online training period?
@arshad171 You are always free to call train()
with more data on the various gensim Word2Vec
/Doc2Vec
/FastText
models. And, it will continue to update any words/internal-weights/etc in accordance with the extra training-examples, and current learning-rate alpha
.
But it won't necessarily make sense, because:
train()
data versus the oldalpha
backpropagation-adjustments, will disturb the original assumptions in arbitrarily complex ways, depending on all sorts of size/coverage differences between prior and new dataThe most grounded course when new data arrives is to create a new combined corpus with all data, and train again from scratch, discarding all prior vectors. Potentially, you could speed the convergence of this new model by re-using some of the state from the older model, at the cost of some murky balance issues. (You might also be able to enforce some level of compatibility of vector spaces, if you locked some/most of the model against updates.)
Some gensim models (Word2vec, FastText) have support for expanding the known vocabulary between training sessions, via the build_vocab(..., update=True)
option. This is at best considered experimental. It still has all the balance issues mentioned above. The handling of leftover state in HS mode may not make any sense at all (and be worse than throwing away all prior state). It's never been debugged to work in Doc2Vec
, with reports of memory-fault crashes.
You can use Doc2Vec
on arbitrarily-big training sets, but the size of the model in addressable memory will be a function of the number of unique words and document-ids targeted for learning during training. If you need to expand the known vocabulary, I'd recommend retraining on the full dataset. If you just need to create doc-vectors for new texts, you can use inference via the infer_vector()
method for any number of new texts. (The model won't grow at all, and any new words will be ignored, but the vectors calculated will remain coordinate-compatible with the model, and any other inferred-vectors.)
@gojomo Thanks for the quick response. You clarified a lot of things :)
Problem description
What are you trying to achieve? What is the expected result? What are you seeing instead?
I tried to continue training from previously saved Doc2Vec model, and I only want to update docvec weights but not wordvec weights (i.e. freeze wv weights during subsequent training). After some search, I did it in the following way (using .load_word2vec_format because the latest Gensim disabled "intersect_word2vec_format" in Doc2Vec). However there is an error during train() - just wondering if this is a bug or I did something wrong (which is as simple as 3 lines, easily reproduced by any corpus/TaggedDocument)? Much appreciated for your help
Steps/code/corpus to reproduce
Include full tracebacks, logs and datasets if necessary. Please keep the examples minimal ("minimal reproducible example").
code:
code for previously saving the doc2vec/word2vec files:
The full error tracebacks are:
Versions
Please provide the output of:
Windows-10-10.0.14393-SP0 Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] NumPy 1.17.2 SciPy 1.3.1 gensim 3.8.1 FAST_VERSION 0