Closed menshikh-iv closed 5 years ago
Thanks for the report @menshikh-iv !
Please don't assign people to tickets, we may have different priorities. If you feel something is urgent, feel free to open a PR with a fix for review.
@piskvorky
Please don't assign people to tickets, we may have different priorities.
Ok, no problem. BTW, what're open-source priorities right now? Any plan/roadmap for this year (maybe I miss something?)
Clean up + tightening + docs + web. Nothing major, very little capacity. We're still discussing priorities and concrete objectives (also for grants).
Was this seen in develop
or latest-release? Because I think @mpenkov's latest fixes (including in #2370) may have finally corrected this behavior to match FB FT.
With develop, I get this:
(devel.env) mpenkov@hetrad2:~/data/2415$ python bug.py
INFO:gensim.models._fasttext_bin:loading 2000000 words for fastText model from cc.en.300.bin
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:Updating model with new vocabulary
INFO:gensim.models.word2vec:New added 2000000 unique words (50% of original 4000000) and increased the count of 2000000 pre-existing words (50% of original 4000000)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 2000000 items
INFO:gensim.models.word2vec:sample=1e-05 downsamples 6996 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 390315457935 word corpus (70.7% of prior 552001338161)
INFO:gensim.models.fasttext:loaded (4000000, 300) weight matrix for fastText model from cc.en.300.bin
/home/mpenkov/git/gensim/gensim/models/keyedvectors.py:2103: RuntimeWarning: invalid value encountered in true_divide
return word_vec / len(ngram_hashes)
So the bug @menshikh-iv described is fixed, but the fix uncovered a divide-by-zero case, because there can be zero ngrams extracted from a single space character. We have several ways to handle this:
@piskvorky Which one do you think is best?
No idea. What does FB's FT do?
word
word 0.039659 0.094314 0.057308 0.060724 0.026905
word space
word 0.039659 0.094314 0.057308 0.060724 0.026905
space -0.039249 0.032824 0.047025 0.065525 0.055325
^C
The command-line utility ignores the request for a vector for a blank space. So if you say "give me the vector for a blank space", the utility just stares back at you. If you give it a term with spaces, they first split the term into subterms by spaces, and return the vectors for each subterm.
OK, thanks. And what does their API (as opposed to CLI) do?
Also: with the default min_n=4
, any single-character string, even with the '<' and '>' end-bumpers added to distinguish leading/ending n-grams, will decompose into zero n-grams to look up.
So what does FT CLI do for single-character OOV words with no n-grams to look up?
We should probably do similar for OOV tokens with zero relevant n-grams, like '' (empty-string), ' ' (single space), 'j' (single-character). (An OOV token like 'qz' would be padded to '\<qz>', which would yield one 4-char n-gram '\<qz>' capable of being looked up, and get whatever n-gram vector happens to be at that bucket, trained or not.)
Regarding roadmap, I found this: https://github.com/RaRe-Technologies/gensim/wiki/Roadmap-(2018) (our planned roadmap for 2018).
Everything in there still stands (not much progress in 2018). Especially the clean up and discoverability.
Let me just rename it to "Roadmap / priorities for 2019".
Fixed by #2411
Problem:
FastText
in gensim and official version still produce different output on FB pretrained model (issue with oov word without ngrams).Prepare data:
Code:
Exception is correct, but behaviour is wrong (should return zero vector as FB implementation. instead of raising an exception). BTW - when we load & use FB model - we shouldn't raise an exception at all.