Open negar-mokhberian opened 3 years ago
I note that your report mentions an error in intersect_word2vec_format()
(which has a known issue, with possible temp workaround, as per #3094). But your displayed error-stack shows a different, earlier error. Is that stack generated from exactly the code you've provided?
While we want everything that used to work pre-4.0 to still work, I should also note that your recipe here may not be sensible. .intersect_word2vec_format()
is one experimental option from a while back. The update=True
option is another not-very-well supported option - without any good examples of its use & lots of caveats to its results. Mixing the two as you've done here adds even more complications: in particular, only the words in your local corpus (sentences_tokenized
) have meaningful word-frequencies for your upcoming training. All the words imported via the update just appear once - making their participation in later steps, like say the frequency-weighted negative-sampling during training, totally unlike normal word2vec. (Further, though it may just be an artifact of your tiny example, min_count=1
is almost always a bad idea in word2vec models.)
Altogether, that means even if we apply fixes or workarounds to get this code to run with blatant exection errors, I'd not put much stock in its outputs, without lots of further analysis/evaluation of tradeoffs created by the improvised non-standard hybrid process.
Problem description
I am trying to update the pre-trained word2vec Google News model based on a corpus that I have. This code used to work when I was using gensim v3.x. After reading the migrate notes and applying changes still I get an error from intersect_word2vec_format. I would appreciate any help on this issue.
gensim==4.0.1
the output: