piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.65k stars 4.37k forks source link

Overflow error after unicode errors when loading a 'large' model built with gensim #2950

Open svenski opened 4 years ago

svenski commented 4 years ago

Problem description

What are you trying to achieve? I am loading a fasttext model built with gensim, using gensim.models.fasttext.load_facebook_model so I can use the model.

What is the expected result? The model loads correctly.

What are you seeing instead? Overflow error, preceded by unicode parsing errors.

Steps/code/corpus to reproduce

I get an overflow error when I try to load a fasttext model which I built with gensim. I have tried with versions 3.8.3 and then rebuild and load with the head of the code 4.0.0-dev as of yesterday. It's not reproducible because I cannot share the corpus.

Here is the stack trace:

  In [21]: ft = load_facebook_model('data/interim/ft_model.bin')
  2020-09-16 15:59:59,526 : MainThread : INFO : loading 582693 words for fastText model from data/interim/ft_model.
  bin
  2020-09-16 15:59:59,626 : MainThread : ERROR : failed to decode invalid unicode bytes b'\x8a\x08'; replacing in
  lid characters, using '\\x8a\x08'
  2020-09-16 15:59:59,684 : MainThread : ERROR : failed to decode invalid unicode bytes b'\xb0\x03'; replacing in
  lid characters, using '\\xb0\x03'
  2020-09-16 15:59:59,775 : MainThread : ERROR : failed to decode invalid unicode bytes b'\xb5\x01'; replacing in
  lid characters, using '\\xb5\x01'
  2020-09-16 15:59:59,801 : MainThread : ERROR : failed to decode invalid unicode bytes b'\x99\xe9\xa2\x9d'; repl
  ing invalid characters, using '\\x99额'
  ---------------------------------------------------------------------------
  OverflowError                             Traceback (most recent call last)
  <ipython-input-21-3b4a7ad71a41> in <module>
  ----> 1 ft = load_facebook_model('data/interim/ft_model.bin')

  /m/virtualenvs/<snip>/lib/python3.6/site-packages/gensim/models/fasttext.py in load_f
  ebook_model(path, encoding)
     1140
     1141     """
  -> 1142     return _load_fasttext_format(path, encoding=encoding, full_model=True)
     1143
     1144

  /m/virtualenvs/<snip>/lib/python3.6/site-packages/gensim/models/fasttext.py in _load_
  sttext_format(model_file, encoding, full_model)
     1220     """
     1221     with gensim.utils.open(model_file, 'rb') as fin:
  -> 1222         m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)
     1223
     1224     model = FastText(

  /m/virtualenvs/<snip>/python3.6/site-packages/gensim/models/_fasttext_bin.py in l
  d(fin, encoding, full_model)
      342     model.update(raw_vocab=raw_vocab, vocab_size=vocab_size, nwords=nwords, ntokens=ntokens)
      343
  --> 344     vectors_ngrams = _load_matrix(fin, new_format=new_format)
      345
      346     if not full_model:

  /m/virtualenvs/<snip>/lib/python3.6/site-packages/gensim/models/_fasttext_bin.py in _
  ad_matrix(fin, new_format)
      276         matrix = _fromfile(fin, _FLOAT_DTYPE, count)
      277     else:
  --> 278         matrix = np.fromfile(fin, _FLOAT_DTYPE, count)
      279
      280     assert matrix.shape == (count,), 'expected (%r,),  got %r' % (count, matrix.shape)

  OverflowError: Python int too large to convert to C ssize_t

The count of the erroneous words are also off the scale:

  In [41]: raw_vocab['\\x8a\x08']
  Out[41]: 7088947288457871360

  In [42]: raw_vocab['\\xb0\x03']
  Out[42]: 3774297962713186304

  In [43]: raw_vocab['\\xb5\x01']
  Out[43]: 7092324988178399232

I saw that there were many changes from int to long long both in 3.8.3 and also in 4.0.0-dev so my hypothesis was that it would be resolved when updating but I got the same error.

I don't know if this is sufficient information to go in in order to pin it down, please let me know if I can help with more information.

Versions

Please provide the output of:

>>> import platform; print(platform.platform())
Linux-2.6.32-754.3.5.el6.x86_64-x86_64-with-centos-6.10-Final
>>> import sys; print("Python", sys.version)
Python 3.6.10 (default, Jul  8 2020, 16:15:16) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-23)]
>>> import struct; print("Bits", 8 * struct.calcsize("P"))
Bits 64
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.19.2
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.5.2
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.8.3
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 1
gojomo commented 4 years ago

You say that the model was created with Gensim; how was it initially saved? If saved via ft_model.save(), you should load with FastText.load().

If you saved it via Gensim's save_facebook_model(), it should be in the Facebook-tools-native format readable by .load_facebook_model() - but such full save/load capability is a brand-new feature, which might still have issues. In such a case, it'd be useful to know if:

svenski commented 4 years ago
gojomo commented 4 years ago
  • I have not seen a case where I have the unicode error but not the overflow error, or vice versa. But I've only tried on a partial corpus of 4.6 GB. I'll try to trigger the unicode error using similar sized chunks. If I understood you correctly, in order to test for the unicode error, your hypothesis is that I should only need to do a FastText(...).build_vocab(LineSentence(input_file) followed by a save_facebook_model and then load_facebook_model -- is that right? It would save a lot of time not doing a training loop!

Yes - as training ultimately only adjusts the numbers in the already-allocated arrays, it shouldn't be implicated in any save/load errors triggered by strings/string-encodings/model-extent/etc. (If correctness/usefulness of results was implicated, or it was some error like a crash only triggered during training, that'd be different.)

If size is the key factor, than a large synthetic corpus generated y a tiny amount of code (similar to this one in a previous issue) may be sufficient to trigger. (Not necessarily a corpus_file-based corpus, unless that's already what you're using.)

Meanwhile, if the unicode errors have anything to do with the actual text content, they might be triggerable with a toy-sized corpus of a few tokens using the same text. Similarly, with the unicode errors, it'd be interesting to take any suspect corpus and try: (1) train/save in original FB fasttext but load in gensim; and (2) train/save in gensim & ten load/test in FB fasttext.