Overflow error after unicode errors when loading a 'large' model built with gensim

Problem description

What are you trying to achieve? I am loading a fasttext model built with gensim, using gensim.models.fasttext.load_facebook_model so I can use the model.

What is the expected result? The model loads correctly.

What are you seeing instead? Overflow error, preceded by unicode parsing errors.

Steps/code/corpus to reproduce

I get an overflow error when I try to load a fasttext model which I built with gensim. I have tried with versions 3.8.3 and then rebuild and load with the head of the code 4.0.0-dev as of yesterday. It's not reproducible because I cannot share the corpus.

Here is the stack trace:

  In [21]: ft = load_facebook_model('data/interim/ft_model.bin')
  2020-09-16 15:59:59,526 : MainThread : INFO : loading 582693 words for fastText model from data/interim/ft_model.
  bin
  2020-09-16 15:59:59,626 : MainThread : ERROR : failed to decode invalid unicode bytes b'\x8a\x08'; replacing in
  lid characters, using '\\x8a\x08'
  2020-09-16 15:59:59,684 : MainThread : ERROR : failed to decode invalid unicode bytes b'\xb0\x03'; replacing in
  lid characters, using '\\xb0\x03'
  2020-09-16 15:59:59,775 : MainThread : ERROR : failed to decode invalid unicode bytes b'\xb5\x01'; replacing in
  lid characters, using '\\xb5\x01'
  2020-09-16 15:59:59,801 : MainThread : ERROR : failed to decode invalid unicode bytes b'\x99\xe9\xa2\x9d'; repl
  ing invalid characters, using '\\x99额'
  ---------------------------------------------------------------------------
  OverflowError                             Traceback (most recent call last)
  <ipython-input-21-3b4a7ad71a41> in <module>
  ----> 1 ft = load_facebook_model('data/interim/ft_model.bin')

  /m/virtualenvs/<snip>/lib/python3.6/site-packages/gensim/models/fasttext.py in load_f
  ebook_model(path, encoding)
     1140
     1141     """
  -> 1142     return _load_fasttext_format(path, encoding=encoding, full_model=True)
     1143
     1144

  /m/virtualenvs/<snip>/lib/python3.6/site-packages/gensim/models/fasttext.py in _load_
  sttext_format(model_file, encoding, full_model)
     1220     """
     1221     with gensim.utils.open(model_file, 'rb') as fin:
  -> 1222         m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)
     1223
     1224     model = FastText(

  /m/virtualenvs/<snip>/python3.6/site-packages/gensim/models/_fasttext_bin.py in l
  d(fin, encoding, full_model)
      342     model.update(raw_vocab=raw_vocab, vocab_size=vocab_size, nwords=nwords, ntokens=ntokens)
      343
  --> 344     vectors_ngrams = _load_matrix(fin, new_format=new_format)
      345
      346     if not full_model:

  /m/virtualenvs/<snip>/lib/python3.6/site-packages/gensim/models/_fasttext_bin.py in _
  ad_matrix(fin, new_format)
      276         matrix = _fromfile(fin, _FLOAT_DTYPE, count)
      277     else:
  --> 278         matrix = np.fromfile(fin, _FLOAT_DTYPE, count)
      279
      280     assert matrix.shape == (count,), 'expected (%r,),  got %r' % (count, matrix.shape)

  OverflowError: Python int too large to convert to C ssize_t

There are no errors or warnings in the model building using the same .
A quick check showed there are no unicode errors in the input file, but very well possible that there are Chinese characters.
The count variable is calculated as count = num_vectors * dim. Both of these are astronomical at 10^23, dim should be 100, so there must be some unpacking problem here already. The unpacking of model params pre vocab look ok.
The input dataset is somewhat large at 26 GB, one epoch is sufficient.
The build and load works with a truncated file which is 4.8 GB. So change in size as well as corpus -- could be that the problematic input is not included.
The same input file works when running with the python fasttext module, so I have a workaround.

The count of the erroneous words are also off the scale:

  In [41]: raw_vocab['\\x8a\x08']
  Out[41]: 7088947288457871360

  In [42]: raw_vocab['\\xb0\x03']
  Out[42]: 3774297962713186304

  In [43]: raw_vocab['\\xb5\x01']
  Out[43]: 7092324988178399232

I saw that there were many changes from int to long long both in 3.8.3 and also in 4.0.0-dev so my hypothesis was that it would be resolved when updating but I got the same error.

I don't know if this is sufficient information to go in in order to pin it down, please let me know if I can help with more information.

Versions

Please provide the output of:

>>> import platform; print(platform.platform())
Linux-2.6.32-754.3.5.el6.x86_64-x86_64-with-centos-6.10-Final
>>> import sys; print("Python", sys.version)
Python 3.6.10 (default, Jul  8 2020, 16:15:16) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-23)]
>>> import struct; print("Bits", 8 * struct.calcsize("P"))
Bits 64
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.19.2
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.5.2
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.8.3
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 1

You say that the model was created with Gensim; how was it initially saved? If saved via ft_model.save(), you should load with FastText.load().

If you saved it via Gensim's save_facebook_model(), it should be in the Facebook-tools-native format readable by .load_facebook_model() - but such full save/load capability is a brand-new feature, which might still have issues. In such a case, it'd be useful to know if:

doing all steps in a 4.0.0dev (github checkout) triggers the same error
if there are subsets of your training data that create models that don't trigger the error, whether you can find minimal triggering subsets of your data - for example, for the unicode errors (which may be totally separate from the overlflow), is there any tiny subset of the data that still triggers them, of which you could share a synthetic approximation? (For such errors, it's likely training isn't even necessary, as after the build_vocab() step the model already has all the strings/allocations needed for a save/reload.) And for the overflow, is there a specific (dimensionality x unique-word-count) threshold that always triggers the error, no matter the training data?

I used gensim.models.fasttext.save_facebook_model
4.0.0dev (from yesterday) yielded the same result. I used it for both creating the model and loading (with some required code changes due to interface changes)
I have not seen a case where I have the unicode error but not the overflow error, or vice versa. But I've only tried on a partial corpus of 4.6 GB. I'll try to trigger the unicode error using similar sized chunks. If I understood you correctly, in order to test for the unicode error, your hypothesis is that I should only need to do a FastText(...).build_vocab(LineSentence(input_file) followed by a save_facebook_model and then load_facebook_model -- is that right? It would save a lot of time not doing a training loop!
I think once it is established whether they issues are separate, I can look at the the relationship of overflow error and the size of input.

I have not seen a case where I have the unicode error but not the overflow error, or vice versa. But I've only tried on a partial corpus of 4.6 GB. I'll try to trigger the unicode error using similar sized chunks. If I understood you correctly, in order to test for the unicode error, your hypothesis is that I should only need to do a FastText(...).build_vocab(LineSentence(input_file) followed by a save_facebook_model and then load_facebook_model -- is that right? It would save a lot of time not doing a training loop!

Yes - as training ultimately only adjusts the numbers in the already-allocated arrays, it shouldn't be implicated in any save/load errors triggered by strings/string-encodings/model-extent/etc. (If correctness/usefulness of results was implicated, or it was some error like a crash only triggered during training, that'd be different.)

If size is the key factor, than a large synthetic corpus generated y a tiny amount of code (similar to this one in a previous issue) may be sufficient to trigger. (Not necessarily a corpus_file-based corpus, unless that's already what you're using.)

Meanwhile, if the unicode errors have anything to do with the actual text content, they might be triggerable with a toy-sized corpus of a few tokens using the same text. Similarly, with the unicode errors, it'd be interesting to take any suspect corpus and try: (1) train/save in original FB fasttext but load in gensim; and (2) train/save in gensim & ten load/test in FB fasttext.

piskvorky / gensim