AssertionError: expected to reach EOF when loading full FastText model

nshaud commented 5 years ago

Problem description

I am trying to fine-tune a pretrained FastText using gensim. I use the weights from the official Facebook implementation. Partial loading works fine, but full model loading results in AssertionError.

Steps/code/corpus to reproduce

import gensim
model = gensim.models.FastText.load_fasttext_format('cc.en.300.bin', full_model=True)

results in

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-16-1896fcc1d1cb> in <module>
----> 1 model = gensim.models.FastText.load_fasttext_format('cc.en.300.bin', full_model=True)

~/.anaconda3/envs/qs3.6/lib/python3.6/site-packages/gensim/models/fasttext.py in load_fasttext_format(cls, model_file, encoding, full_model)
   1012 
   1013         """
-> 1014         return _load_fasttext_format(model_file, encoding=encoding, full_model=full_model)
   1015 
   1016     def load_binary_data(self, encoding='utf8'):

~/.anaconda3/envs/qs3.6/lib/python3.6/site-packages/gensim/models/fasttext.py in _load_fasttext_format(model_file, encoding, full_model)
   1246         model_file += '.bin'
   1247     with smart_open(model_file, 'rb') as fin:
-> 1248         m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)
   1249 
   1250     model = FastText(

~/.anaconda3/envs/qs3.6/lib/python3.6/site-packages/gensim/models/_fasttext_bin.py in load(fin, encoding, full_model)
    264     else:
    265         hidden_output = _load_matrix(fin, new_format=new_format)
--> 266         assert fin.read() == b'', 'expected to reach EOF'
    267 
    268     model.update(vectors_ngrams=vectors_ngrams, hidden_output=hidden_output)

AssertionError: expected to reach EOF

Versions

Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

Linux-4.4.0-139-generic-x86_64-with-debian-stretch-sid
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0]
NumPy 1.16.2
SciPy 1.2.1
gensim 3.7.1
FAST_VERSION 1

nshaud commented 5 years ago

It still fails using the fasttext.load_facebook_model method, however using the French embeddings, it works:

import gensim
model = gensim.models.fasttext.load_facebook_model('/data/cc.fr.300.bin') 
model.wv['test']
# array([ 0.03151339, -0.04408491, ... 0.0188015 ,  0.032352  ], dtype=float32)

It also works using the Wikipedia English embeddings (wiki.en.bin). Does this mean that there is something wrong with the format of cc.en.300.bin ?

mpenkov commented 5 years ago

Thank you for reporting this. Could you provide full URLs to the models you are using, so I can try to reproduce this?

nshaud commented 5 years ago

Here are all the models I mentioned:

cc.en.300.bin = FastText English CommonCrawl
cc.fr.300.bin = same in French
wiki.en.bin = Wiki Word Vectors from FastText

mpenkov commented 5 years ago

I think gensim 3.7.2 already fixed this problem. Could you please double check?

(372.env) mpenkov@hetrad2:~/data/2435$ pip freeze | grep gensim
gensim==3.7.2
(372.env) mpenkov@hetrad2:~/data/2435$ cat bug.py
import gensim.models.fasttext
vector = gensim.models.fasttext.load_facebook_vectors('../cc.en.300.bin') 
print(vector)
model = gensim.models.fasttext.load_facebook_model('../cc.en.300.bin') 
print(model)
(372.env) mpenkov@hetrad2:~/data/2435$ python bug.py 
<gensim.models.keyedvectors.FastTextKeyedVectors object at 0x7f815e2005c0>
FastText(vocab=2000000, size=300, alpha=0.025)
(372.env) mpenkov@hetrad2:~/data/2435$

nshaud commented 5 years ago

I tried again with gensim 3.7.2 after redownloading the model file from Facebook's FastText page and it seems to work. The md5 checksums of old and new files are not the same, so I guess a corrupted model was the problem.

piskvorky / gensim