Open loretoparisi opened 7 years ago
Word2Vec.load()
only loads models saved from gensim. (It uses Python pickling.)
I believe that .vec
file is in the format used by the original Google word2vec.c (and now FastText) for its top-level vectors, so KeyedVectors.load_word2vec_format()
may work, perhaps with a binary=False
parameter.
The method gensim.models.wrappers.fasttext.FastText.load_fasttext_format()
may also be relevant to bring in ngrams for OOV word vector synthesis may by of interest too... but I'm not sure if it's yet doing the right thing in the released gensim, as compared to PR-in-progress #1341.
@jayantj @prakhar2b wdyt?
@gojomo yes, KeyedVectors.load_word2vec_format()
will definitely work here, and also binary=False
is default parameter.
As for OOV word synthesis, what do you mean by not sure if it's yet doing the right thing in the released gensim
. I think for OOV, we need n-gram informations which is provided in .bin
file.
As of now, gensim.models.wrappers.fasttext.FastText.load_fasttext_format()
is used to load complete model for this purpose using both vec and bin
files. With PR#1341
, we will need only bin
file, rest all functionalities will remain same I believe.
cc @jayantj @menshikh-iv
Yes, with the .bin
AND the .vec
file, you can load the complete model using -
from gensim.models.wrappers.fasttext import FastText
model = FastText.load_fasttext_format('/path/to/model') # without the .bin/.vec extension
With the .vec
file, you can load only the word vectors (and not the out-of-vocab word information) using -
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('/path/to/model.vec') # with the .vec extension
@jayantj Thank, let me try first with the load_fasttext_format
and FastText
wrapper
@prakhar2b My "not sure" comment was regarding to some discussion I saw on another issue or PR in progress, perhaps the one that's also discussing whether the discarding-of-untrained-ngrams is a necessary optimization – I had the impression our calculation might be diverging from the original FB fasttext on some (perhaps just OOV) words. (And even if that's defensible, because the untrained ngrams are still just random vectors, it might not be the 'right thing' overall because it may violate user expectations that whether loaded into original FT code, or gensim FT code, OOV words get the same vectors from the same loaded model.)
We definitely want to follow whatever the original FT does -- the path of least surprise for anyone migrating / trying both.
I get this error while loading
wiki.en.vec
from FastText pre-trained Word2Vec model. See here for this model.loaded with
I'm using