piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.65k stars 4.37k forks source link

Word2Vec Retraining Fails after Adding New Word Vectors from another Model #2803

Open ChristianAngel opened 4 years ago

ChristianAngel commented 4 years ago

Problem description

Using the add() function to add new word vectors to a model from a different model and having the first model retrain on its own dataset causes an exception. Given two separate Word2Vec models trained on different data, we are trying to retrain one model after adding word vectors that are present in the other model but not originally present in the first model.

Expected result: Retraining succeeds.

Actual result: Retraining fails with: AttributeError: 'Vocab' object has no attribute 'code'

Steps/code/corpus to reproduce

Minimal reproducible example:

import gensim.downloader
from gensim.models import Word2Vec

print("Loading text8 as list. This should succeed.")
dataset = list(gensim.downloader.load("text8"))

print("Splitting text8. This should succeed.")
dataset1 = dataset[:int(len(dataset)/2)]
dataset2 = dataset[int(len(dataset)/2):]

print("Training model1. This should succeed.")
model1 = Word2Vec(dataset1, size=300, workers=1, negative=0, hs=1, sample=0)

print("Training model2. This should succeed.")
model2 = Word2Vec(dataset2, size=300, workers=1, negative=0, hs=1, sample=0)

print("Initiating first retraining. This should succeed.")
model1.train(dataset1, total_examples=len(dataset1), epochs=model1.epochs)

# Based on the documentation, this is the idiom for adding the word vectors that are present in model2 but not in model1                     
print("Adding vocab from model2. This should succeed.")
model1.wv.add(list(model2.wv.vocab.keys()), model2.wv.syn0, replace=False)

print("Initiating second retraining. This fails.")
model1.train(dataset1, total_examples=len(dataset1), epochs=model1.epochs)

Output:

Loading text8 as list. This should succeed.
Splitting text8. This should succeed.
Training model1. This should succeed.
Training model2. This should succeed.
Initiating first retraining. This should succeed.
Adding vocab from model2. This should succeed.
Initiating second retraining. This should fail.
Exception in thread Thread-31:
Traceback (most recent call last):
  File "/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "~/.local/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 211, in _worker_loop
    tally, raw_tally = self._do_train_job(data_iterable, job_parameters, thread_private_mem)
  File "~/.local/lib/python3.6/site-packages/gensim/models/word2vec.py", line 821, in _do_train_job
    tally += train_batch_cbow(self, sentences, alpha, work, neu1, self.compute_loss)
  File "gensim/models/word2vec_inner.pyx", line 638, in gensim.models.word2vec_inner.train_batch_cbow
AttributeError: 'Vocab' object has no attribute 'code'

Versions

Linux-3.10.0-862.2.3.el7.x86_64-x86_64-with-centos-7.5.1804-Core Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) [GCC 7.3.0] NumPy 1.16.4 SciPy 1.3.0 gensim 3.8.1 FAST_VERSION 1

gojomo commented 4 years ago

For now, this usage pattern hasn't been designed-for, and isn't supported.

Simply adding new vectors to a w2v_model.wv object doesn't update all aspects of the model necessary for training. In particular, it may not have the word-frequency info needed for future negative-sampling/subsamping, and it doesn't expand the output-layer of the model for the new words. (That'd could be relatively straightforward for negative-sampling, where each predicted-word has one output node, but potentially a big mess destroying all existing internal weights for hierarchical-softmax, as potentially 'all' words' pre-existing huffman codings are changed.)

It could plausibly be made to work as a new feature – more sensibly in the negative-sampling case. But many tricky balance/effectiveness issues with incremental model expansion & retraining make this an experimental/advanced option, and for most users the safe course would be retraining on a combined corpus.

gojomo commented 4 years ago

Potentially, we could include better warnings when this unsupported-but-tempting approach is tried. A KeyedVectors that's 'inside' a Word2Vec (or other) model might have some sort of 'lock' bit set, or have the model present as its 'owner' which could potentially veto-error/warn any other attempts to mutate the KV while it's dependent on it. Or, the containing Word2Vec model might simply keep some redundant measures of the KV it's relying upon, and error/warn when in notices other mutations that might make it no-longer matched to the outer model.

But, armoring against all potential programmer-driven functionality-breaking mutations would be an endless task. Perhaps just add a doc note to the models that have a .wv subcomponent: "You should treat this as read-only, and otherwise essentially private to the containing model, for the duration of any of the model's intended training."