Updating W2V - Githubissues

piskvorky / gensim

Topic Modelling for Humans

GNU Lesser General Public License v2.1

15.55k stars 4.37k forks source link

Problem description

I am trying to update the pre-trained word2vec Google News model based on a corpus that I have. This code used to work when I was using gensim v3.x. After reading the migrate notes and applying changes still I get an error from intersect_word2vec_format. I would appreciate any help on this issue.

pretrained_path="./GoogleNews-vectors-negative300.bin"
tokenizer = RegexpTokenizer(r'\w+')
sentences_tokenized = [['Hey', 'these', 'are', 'some', 'new', 'words'], ['Lets', 'expand', 'the', 'vocab']]
model_2 = Word2Vec(vector_size=300, min_count=1)
model_2.build_vocab(sentences_tokenized)
total_examples = model_2.corpus_count
model = KeyedVectors.load_word2vec_format(pretrained_path, binary=True)
model_2.build_vocab([list(model.key_to_index.keys())], update=True)
model_2.intersect_word2vec_format(pretrained_path, binary=True, lockf=1.0) # lockf=0.0 means no change
model_2.train(sentences_tokenized, total_examples=total_examples, epochs=model_2.epochs)

gensim==4.0.1

the output:

 model_2.build_vocab([list(model.key_to_index.keys())], update=True)
  File "/home/miniconda3/lib/python3.8/site-packages/gensim/models/word2vec.py", line 486, in build_vocab
    self.prepare_weights(update=update)
  File "/home/miniconda3/lib/python3.8/site-packages/gensim/models/word2vec.py", line 844, in prepare_weights
    self.update_weights()
  File "/home/miniconda3/lib/python3.8/site-packages/gensim/models/word2vec.py", line 865, in update_weights
    raise RuntimeError(
RuntimeError: You cannot do an online vocabulary-update of a model which has no prior vocabulary. First build the vocabulary of your model with a corpus before doing an online update.

I note that your report mentions an error in intersect_word2vec_format() (which has a known issue, with possible temp workaround, as per #3094). But your displayed error-stack shows a different, earlier error. Is that stack generated from exactly the code you've provided?

While we want everything that used to work pre-4.0 to still work, I should also note that your recipe here may not be sensible. .intersect_word2vec_format() is one experimental option from a while back. The update=True option is another not-very-well supported option - without any good examples of its use & lots of caveats to its results. Mixing the two as you've done here adds even more complications: in particular, only the words in your local corpus (sentences_tokenized) have meaningful word-frequencies for your upcoming training. All the words imported via the update just appear once - making their participation in later steps, like say the frequency-weighted negative-sampling during training, totally unlike normal word2vec. (Further, though it may just be an artifact of your tiny example, min_count=1 is almost always a bad idea in word2vec models.)

Altogether, that means even if we apply fixes or workarounds to get this code to run with blatant exection errors, I'd not put much stock in its outputs, without lots of further analysis/evaluation of tradeoffs created by the improvised non-standard hybrid process.

piskvorky / gensim

Updating W2V #3195

Problem description

gensim==4.0.1