piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.55k stars 4.37k forks source link

cPickle.UnpicklingError: unpickling stack underflow #1447

Open loretoparisi opened 7 years ago

loretoparisi commented 7 years ago

I get this error while loading wiki.en.vec from FastText pre-trained Word2Vec model. See here for this model.

2017-06-23 16:41:40,834 : INFO : loading Word2Vec object from /Volumes/Dataset/word2vec/wiki.en/wiki.en.vec
Traceback (most recent call last):
  File "loadlyricsmodel.py", line 45, in <module>
    model = Word2Vec.load( model_filepath )
  File "/Users/loretoparisi/Documents/Projects/word2vec/.env/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1382, in load
    model = super(Word2Vec, cls).load(*args, **kwargs)
  File "/Users/loretoparisi/Documents/Projects/word2vec/.env/lib/python2.7/site-packages/gensim/utils.py", line 271, in load
    obj = unpickle(fname)
  File "/Users/loretoparisi/Documents/Projects/word2vec/.env/lib/python2.7/site-packages/gensim/utils.py", line 935, in unpickle
    return _pickle.loads(f.read())
cPickle.UnpicklingError: unpickling stack underflow

loaded with

model = Word2Vec.load( model_filepath )

I'm using

gensim-2.2.0-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
gojomo commented 7 years ago

Word2Vec.load() only loads models saved from gensim. (It uses Python pickling.)

I believe that .vec file is in the format used by the original Google word2vec.c (and now FastText) for its top-level vectors, so KeyedVectors.load_word2vec_format() may work, perhaps with a binary=False parameter.

The method gensim.models.wrappers.fasttext.FastText.load_fasttext_format() may also be relevant to bring in ngrams for OOV word vector synthesis may by of interest too... but I'm not sure if it's yet doing the right thing in the released gensim, as compared to PR-in-progress #1341.

menshikh-iv commented 7 years ago

@jayantj @prakhar2b wdyt?

prakhar2b commented 7 years ago

@gojomo yes, KeyedVectors.load_word2vec_format() will definitely work here, and also binary=False is default parameter.

As for OOV word synthesis, what do you mean by not sure if it's yet doing the right thing in the released gensim. I think for OOV, we need n-gram informations which is provided in .bin file.

As of now, gensim.models.wrappers.fasttext.FastText.load_fasttext_format() is used to load complete model for this purpose using both vec and bin files. With PR#1341, we will need only bin file, rest all functionalities will remain same I believe.

cc @jayantj @menshikh-iv

jayantj commented 7 years ago

Yes, with the .bin AND the .vec file, you can load the complete model using -

from gensim.models.wrappers.fasttext import FastText
model = FastText.load_fasttext_format('/path/to/model')  # without the .bin/.vec extension

With the .vec file, you can load only the word vectors (and not the out-of-vocab word information) using -

from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('/path/to/model.vec')  # with the .vec extension
loretoparisi commented 7 years ago

@jayantj Thank, let me try first with the load_fasttext_format and FastText wrapper

gojomo commented 7 years ago

@prakhar2b My "not sure" comment was regarding to some discussion I saw on another issue or PR in progress, perhaps the one that's also discussing whether the discarding-of-untrained-ngrams is a necessary optimization – I had the impression our calculation might be diverging from the original FB fasttext on some (perhaps just OOV) words. (And even if that's defensible, because the untrained ngrams are still just random vectors, it might not be the 'right thing' overall because it may violate user expectations that whether loaded into original FT code, or gensim FT code, OOV words get the same vectors from the same loaded model.)

piskvorky commented 7 years ago

We definitely want to follow whatever the original FT does -- the path of least surprise for anyone migrating / trying both.