Open svenski opened 4 years ago
You say that the model was created with Gensim; how was it initially saved? If saved via ft_model.save()
, you should load with FastText.load()
.
If you saved it via Gensim's save_facebook_model()
, it should be in the Facebook-tools-native format readable by .load_facebook_model()
- but such full save/load capability is a brand-new feature, which might still have issues. In such a case, it'd be useful to know if:
build_vocab()
step the model already has all the strings/allocations needed for a save/reload.) And for the overflow, is there a specific (dimensionality x unique-word-count) threshold that always triggers the error, no matter the training data? gensim.models.fasttext.save_facebook_model
FastText(...).build_vocab(LineSentence(input_file)
followed by a save_facebook_model
and then load_facebook_model
-- is that right? It would save a lot of time not doing a training loop!
- I have not seen a case where I have the unicode error but not the overflow error, or vice versa. But I've only tried on a partial corpus of 4.6 GB. I'll try to trigger the unicode error using similar sized chunks. If I understood you correctly, in order to test for the unicode error, your hypothesis is that I should only need to do a
FastText(...).build_vocab(LineSentence(input_file)
followed by asave_facebook_model
and thenload_facebook_model
-- is that right? It would save a lot of time not doing a training loop!
Yes - as training ultimately only adjusts the numbers in the already-allocated arrays, it shouldn't be implicated in any save/load errors triggered by strings/string-encodings/model-extent/etc. (If correctness/usefulness of results was implicated, or it was some error like a crash only triggered during training, that'd be different.)
If size is the key factor, than a large synthetic corpus generated y a tiny amount of code (similar to this one in a previous issue) may be sufficient to trigger. (Not necessarily a corpus_file
-based corpus, unless that's already what you're using.)
Meanwhile, if the unicode errors have anything to do with the actual text content, they might be triggerable with a toy-sized corpus of a few tokens using the same text. Similarly, with the unicode errors, it'd be interesting to take any suspect corpus and try: (1) train/save in original FB fasttext
but load in gensim; and (2) train/save in gensim & ten load/test in FB fasttext
.
Problem description
What are you trying to achieve? I am loading a
fasttext
model built withgensim
, usinggensim.models.fasttext.load_facebook_model
so I can use the model.What is the expected result? The model loads correctly.
What are you seeing instead? Overflow error, preceded by unicode parsing errors.
Steps/code/corpus to reproduce
I get an overflow error when I try to load a
fasttext
model which I built withgensim
. I have tried with versions 3.8.3 and then rebuild and load with the head of the code 4.0.0-dev as of yesterday. It's not reproducible because I cannot share the corpus.Here is the stack trace:
count
variable is calculated ascount = num_vectors * dim
. Both of these are astronomical at 10^23,dim
should be 100, so there must be some unpacking problem here already. The unpacking of model params pre vocab look ok.fasttext
module, so I have a workaround.The count of the erroneous words are also off the scale:
I saw that there were many changes from
int
tolong long
both in 3.8.3 and also in 4.0.0-dev so my hypothesis was that it would be resolved when updating but I got the same error.I don't know if this is sufficient information to go in in order to pin it down, please let me know if I can help with more information.
Versions
Please provide the output of: