Open ChristianAngel opened 4 years ago
For now, this usage pattern hasn't been designed-for, and isn't supported.
Simply adding new vectors to a w2v_model.wv
object doesn't update all aspects of the model necessary for training. In particular, it may not have the word-frequency info needed for future negative-sampling/subsamping, and it doesn't expand the output-layer of the model for the new words. (That'd could be relatively straightforward for negative-sampling, where each predicted-word has one output node, but potentially a big mess destroying all existing internal weights for hierarchical-softmax, as potentially 'all' words' pre-existing huffman codings are changed.)
It could plausibly be made to work as a new feature – more sensibly in the negative-sampling case. But many tricky balance/effectiveness issues with incremental model expansion & retraining make this an experimental/advanced option, and for most users the safe course would be retraining on a combined corpus.
Potentially, we could include better warnings when this unsupported-but-tempting approach is tried. A KeyedVectors
that's 'inside' a Word2Vec
(or other) model might have some sort of 'lock' bit set, or have the model present as its 'owner' which could potentially veto-error/warn any other attempts to mutate the KV while it's dependent on it. Or, the containing Word2Vec
model might simply keep some redundant measures of the KV it's relying upon, and error/warn when in notices other mutations that might make it no-longer matched to the outer model.
But, armoring against all potential programmer-driven functionality-breaking mutations would be an endless task. Perhaps just add a doc note to the models that have a .wv
subcomponent: "You should treat this as read-only, and otherwise essentially private to the containing model, for the duration of any of the model's intended training."
Problem description
Using the add() function to add new word vectors to a model from a different model and having the first model retrain on its own dataset causes an exception. Given two separate Word2Vec models trained on different data, we are trying to retrain one model after adding word vectors that are present in the other model but not originally present in the first model.
Expected result: Retraining succeeds.
Actual result: Retraining fails with: AttributeError: 'Vocab' object has no attribute 'code'
Steps/code/corpus to reproduce
Minimal reproducible example:
Output:
Versions
Linux-3.10.0-862.2.3.el7.x86_64-x86_64-with-centos-7.5.1804-Core Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) [GCC 7.3.0] NumPy 1.16.4 SciPy 1.3.0 gensim 3.8.1 FAST_VERSION 1