Errors when continuing training Doc2Vec model with previously-saved wv

xinxu75 commented 4 years ago

Problem description

What are you trying to achieve? What is the expected result? What are you seeing instead?

I tried to continue training from previously saved Doc2Vec model, and I only want to update docvec weights but not wordvec weights (i.e. freeze wv weights during subsequent training). After some search, I did it in the following way (using .load_word2vec_format because the latest Gensim disabled "intersect_word2vec_format" in Doc2Vec). However there is an error during train() - just wondering if this is a bug or I did something wrong (which is as simple as 3 lines, easily reproduced by any corpus/TaggedDocument)? Much appreciated for your help

Steps/code/corpus to reproduce

Include full tracebacks, logs and datasets if necessary. Please keep the examples minimal ("minimal reproducible example").

code:

d2v = Doc2Vec.load('previously_saved_d2v_model_file')
d2v.wv = Word2VecKeyedVectors.load_word2vec_format('previously_saved_w2v_binfile', binary=True) # the file was previously saved using Doc2Vec.save_word2vec_format, according to Gensim Docs this freezes wv by not updating 
d2v.train(documents, total_examples=d2v.corpus_count,  epochs=d2v.epochs, start_alpha=0.01, end_alpha=0.001)

code for previously saving the doc2vec/word2vec files:

prev_d2v = Doc2Vec(vector_size=512,
              epochs=9,
              min_count=1,
              max_vocab_size=None,
              window=1024,
              hs=0,
              negative=90,
              workers=2,
              dm=0, # DBOW
              dbow_words=1 # if require Skip-gram word-vector
             )
prev_d2v.build_vocab(docs)
prev_d2v.train(docs,
          total_examples=prev_d2v.corpus_count,
          epochs=prev_d2v.epochs,
          start_alpha=0.01,
          end_alpha=0.001
         )
prev_d2v.save_word2vec_format('previously_saved_w2v_binfile', binary=True)
prev_d2v.save('previously_saved_d2v_model_file')

The full error tracebacks are:

-------------------------------------
Exception in thread Thread-8:
Traceback (most recent call last):
  File "c:\users\documents\miniconda_python\lib\threading.py", line 916, in _bootstrap_inner
    self.run()
  File "c:\users\documents\miniconda_python\lib\threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "c:\users\documents\miniconda_python\lib\site-packages\gensim\models\base_any2vec.py", line 211, in _worker_loop
    tally, raw_tally = self._do_train_job(data_iterable, job_parameters, thread_private_mem)
  File "c:\users\documents\miniconda_python\lib\site-packages\gensim\models\doc2vec.py", line 721, in _do_train_job
    doctag_vectors=doctag_vectors, doctag_locks=doctag_locks
  File "gensim/models/doc2vec_inner.pyx", line 344, in gensim.models.doc2vec_inner.train_document_dbow
AttributeError: 'Vocab' object has no attribute 'sample_int'
Exception in thread Thread-7:
Traceback (most recent call last):
  File "c:\users\documents\miniconda_python\lib\threading.py", line 916, in _bootstrap_inner
    self.run()
  File "c:\users\documents\miniconda_python\lib\threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "c:\users\documents\miniconda_python\lib\site-packages\gensim\models\base_any2vec.py", line 211, in _worker_loop
    tally, raw_tally = self._do_train_job(data_iterable, job_parameters, thread_private_mem)
  File "c:\users\documents\miniconda_python\lib\site-packages\gensim\models\doc2vec.py", line 721, in _do_train_job
    doctag_vectors=doctag_vectors, doctag_locks=doctag_locks
  File "gensim/models/doc2vec_inner.pyx", line 344, in gensim.models.doc2vec_inner.train_document_dbow
AttributeError: 'Vocab' object has no attribute 'sample_int'

Versions

Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

Windows-10-10.0.14393-SP0 Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] NumPy 1.17.2 SciPy 1.3.1 gensim 3.8.1 FAST_VERSION 0

gojomo commented 4 years ago

Thanks for the clear & complete issue report.

Note that there's no official support for loading word2vec vectors into a Doc2Vec model – and trying to cobble it together would require a lot of calling things in non-standard ways, and patching up the resulting models. For example, the error you're getting is because the word2vec vectors you've loaded (via d2v.wv = Word2VecKeyedVectors.load_word2vec_format('previously_saved_w2v_binfile', binary=True)) doesn't have all the values that would have been initialized by a true, start-of-training vocabulary-discovery phase. (The save_word2vec_format(), saving only vectors, doesn't even save that info when it's present.)

But further, simply loading vectors from elsewhere doesn't lock them against further updates. If you got past this error, your training would continue to update the vectors. There is another (experimental) feature that might help: a vectors_lockfvalue which can be used to down-scale or suppress entirely the updates of certain vectors. You could try saving the entire Doc2Vec model, then re-loading it (via .save() and .load()) – but after re-loading, set every index of d2v.wv.vectors_lockf to 0.0. In one way, this should do what you've literally asked for: the word-vectors themselves shouldn't change further.

But it may not have the overall effect you desire, as your new training data will still be adjusting the other internal weights of the model. And, as there's no support for expanding either the word-vocabulary or doc-tag vocabulary of a Doc2Vec model, only training adjustments to prior doc-tags will be the possible benefits of the new training.

(Separately, also, any followup train() calls should be using a total_examples count that matches the current batch of data it's being fed. And it's murky to determine how the learning-rate alpha should be balanced between older training and newer training, especially if the new data only covers some of the original tokens/tags, as new training may drive internal weights related to new examples arbitrarily far from compatible alignment with internal weights only adjusted in earlier training sessions.)

Why are you interested in continuing training, and specifically continuing training with some aspects of the model frozen against updates? Are you sure that either retraining the whole model, or using just infer_vector() on new texts (between occasional larger full retrainings), wouldn't be suffiicient?

xinxu75 commented 4 years ago

@gojomo, thanks for the detailed explanation and suggestion. I'll try setting d2v.wv.vectors_lockf=0 and see how it goes.

In the meantime, it appears Gensim older version (3.2.0) supports this more conveniently/elegant. I experimented with Doc2Vec.intersect_word2vec_format with lockf=0 seems to work? - after calling this function, d2v continues updating while wv is no longer updating, which is the desired effect

For your question, the reason why I need this functionality is to experiment with different ways to train wordvecs and docvec -- I want to disentangle the effects on wv and dv, i.e. fix/freeze a given set of wv and see how well different ways to train docvec perform. Within Gensim environment, I only find one way to do so, in 2 steps: first to generate wv & save; second to load/freeze wv to continue training dv (or maybe resetting dv to restart). Is there any other/better way to flexibly "turn on/off" updating wv/dv for this purpose? Much appreciated if you have any suggestion

gojomo commented 4 years ago

You could certainly review the source of intersect_word2vec_format() & manually perform a similar overwriting of an existing model's vector values, and matching use of vectors_lockf[n]=0.0, to do anything the earlier version did in current versions.

But still: since the other internal weights of the model are being updated by further training – including word->word training (via dbow_words=1) between words that are individually "locked" – I'd again warn the effects might not be quite what you intend, or easy to characterize. And you still won't be expanding the model to include any new words or doc-tags (since there's no support for expanding the set of recognized doc-tags, and even the vocabulary-expansion support inherited from Word2Vec has a lingering crashing bug #1019).

xinxu75 commented 4 years ago

@gojomo, I followed your suggestion to load Doc2Vec and set its wv.vectors_lockf=0, but it looks there is no attribute of d2v.wv.vectors_lockf (AttributeError: 'Word2VecKeyedVectors' object has no attribute 'vectors_lockf')? This happens in 3.8.1

Instead there appears to be d2v.syn0_lockf - Just wondering if this is np array to be set=0.0? Thanks for your help

gojomo commented 4 years ago

I think it may actually be any _lockf arrays in the d2v_model.trainables object - one will be for the word-vectors, another will be for the doc-vectors. (And, anything at d2v_model.syn0_lockf might just be a compatibility pass-through, or other redundancy.)

arshad171 commented 4 years ago

I know that Gensim word2vec supports online learning, and doc2vec uses word2vec internally. So I was just wondering, suppose

I train the doc2vec model on a set of docs (say, set_1) and compute the doc vectors for these docs and store them (or use them for some other purpose).
Stop the training process temporarily.
Decided to train the model again, and train the model on another set (say, set_2)

Now, wouldn't the pre-computed (stored) doc vectors for the set of docs, set_1 be outdated? [Because after the 2nd training period, the model might have added new word vectors/modified existing ones]

If yes, then how can doc2vec be used on big data, will it be required to compute the doc vecs after each online training period?

gojomo commented 4 years ago

@arshad171 You are always free to call train() with more data on the various gensim Word2Vec/Doc2Vec/FastText models. And, it will continue to update any words/internal-weights/etc in accordance with the extra training-examples, and current learning-rate alpha.

But it won't necessarily make sense, because:

by default, any not-previously-allocated words will be simply ignored, not added to the model
only those parts of the model related to the new texts will get any updates, and other parts of the model remain uninfluenced by the new data – perhaps very little of the model, perhaps very much, depending on the overlap and relative sizes of the new train() data versus the old
there are no obviously right choices for how the learning-rate and epoch-count should be balanced between newer training and older training; the original stochastic-gradient-descent optimization was based on all examples getting equal interleaved training attention, and having some new subset of all examples now get "extra" cycles, and perhaps again larger-than-final alpha backpropagation-adjustments, will disturb the original assumptions in arbitrarily complex ways, depending on all sorts of size/coverage differences between prior and new data

The most grounded course when new data arrives is to create a new combined corpus with all data, and train again from scratch, discarding all prior vectors. Potentially, you could speed the convergence of this new model by re-using some of the state from the older model, at the cost of some murky balance issues. (You might also be able to enforce some level of compatibility of vector spaces, if you locked some/most of the model against updates.)

Some gensim models (Word2vec, FastText) have support for expanding the known vocabulary between training sessions, via the build_vocab(..., update=True) option. This is at best considered experimental. It still has all the balance issues mentioned above. The handling of leftover state in HS mode may not make any sense at all (and be worse than throwing away all prior state). It's never been debugged to work in Doc2Vec, with reports of memory-fault crashes.

You can use Doc2Vec on arbitrarily-big training sets, but the size of the model in addressable memory will be a function of the number of unique words and document-ids targeted for learning during training. If you need to expand the known vocabulary, I'd recommend retraining on the full dataset. If you just need to create doc-vectors for new texts, you can use inference via the infer_vector() method for any number of new texts. (The model won't grow at all, and any new words will be ignored, but the vectors calculated will remain coordinate-compatible with the model, and any other inferred-vectors.)

arshad171 commented 4 years ago

@gojomo Thanks for the quick response. You clarified a lot of things :)

piskvorky / gensim