piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.72k stars 4.38k forks source link

Inference issue using FB pretrained model if word have no ngrams #2415

Closed menshikh-iv closed 5 years ago

menshikh-iv commented 5 years ago

Problem: FastText in gensim and official version still produce different output on FB pretrained model (issue with oov word without ngrams).

Prepare data:

curl https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz --output cc.en.300.bin.gz
gunzip cc.en.300.bin.gz

Code:

from gensim.models import FastText
from fastText import load_model
import numpy as np

fname = "cc.en.300.bin"

fb_model = load_model(fname)
gensim_model = FastText.load_fasttext_format(fname)

word = "`"  # you can't generate subwords from this word

assert len(fb_model.get_subwords(word)[0]) == 0  # really, no subwords

fb_vector = fb_model.get_word_vector("`")  # fb_vector exist, though no subwords and not in vocab
np.testing.assert_allclose(fb_vector, np.zeros(fb_vector.shape[0]))  # and fb vector is zero vector

gensim_vector = gensim_model["`"]  # raise an exception
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-13-243ea594a1c4> in <module>
      8 np.testing.assert_allclose(fb_vector, np.zeros(fb_vector.shape[0]))
      9 
---> 10 gensim_vector = gensim_model["`"]

~/.local/lib/python3.6/site-packages/gensim/utils.py in new_func1(*args, **kwargs)
   1445                     stacklevel=2
   1446                 )
-> 1447                 return func(*args, **kwargs)
   1448 
   1449             return new_func1

~/.local/lib/python3.6/site-packages/gensim/models/fasttext.py in __getitem__(self, words)
    926 
    927         """
--> 928         return self.wv.__getitem__(words)
    929 
    930     @deprecated("Method will be removed in 4.0.0, use self.wv.__contains__() instead")

~/.local/lib/python3.6/site-packages/gensim/models/keyedvectors.py in __getitem__(self, entities)
    345         if isinstance(entities, string_types):
    346             # allow calls like trained_model['office'], as a shorthand for trained_model[['office']]
--> 347             return self.get_vector(entities)
    348 
    349         return vstack([self.get_vector(entity) for entity in entities])

~/.local/lib/python3.6/site-packages/gensim/models/keyedvectors.py in get_vector(self, word)
    463 
    464     def get_vector(self, word):
--> 465         return self.word_vec(word)
    466 
    467     def words_closer_than(self, w1, w2):

~/.local/lib/python3.6/site-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
   2096                 return word_vec / max(1, ngrams_found)
   2097             else:  # No ngrams of the word are present in self.ngrams
-> 2098                 raise KeyError('all ngrams for word %s absent from model' % word)
   2099 
   2100     def init_sims(self, replace=False):

KeyError: 'all ngrams for word ` absent from model'

Exception is correct, but behaviour is wrong (should return zero vector as FB implementation. instead of raising an exception). BTW - when we load & use FB model - we shouldn't raise an exception at all.

piskvorky commented 5 years ago

Thanks for the report @menshikh-iv !

Please don't assign people to tickets, we may have different priorities. If you feel something is urgent, feel free to open a PR with a fix for review.

menshikh-iv commented 5 years ago

@piskvorky

Please don't assign people to tickets, we may have different priorities.

Ok, no problem. BTW, what're open-source priorities right now? Any plan/roadmap for this year (maybe I miss something?)

piskvorky commented 5 years ago

Clean up + tightening + docs + web. Nothing major, very little capacity. We're still discussing priorities and concrete objectives (also for grants).

gojomo commented 5 years ago

Was this seen in develop or latest-release? Because I think @mpenkov's latest fixes (including in #2370) may have finally corrected this behavior to match FB FT.

mpenkov commented 5 years ago

With develop, I get this:

(devel.env) mpenkov@hetrad2:~/data/2415$ python bug.py 
INFO:gensim.models._fasttext_bin:loading 2000000 words for fastText model from cc.en.300.bin
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:Updating model with new vocabulary
INFO:gensim.models.word2vec:New added 2000000 unique words (50% of original 4000000) and increased the count of 2000000 pre-existing words (50% of original 4000000)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 2000000 items
INFO:gensim.models.word2vec:sample=1e-05 downsamples 6996 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 390315457935 word corpus (70.7% of prior 552001338161)
INFO:gensim.models.fasttext:loaded (4000000, 300) weight matrix for fastText model from cc.en.300.bin
/home/mpenkov/git/gensim/gensim/models/keyedvectors.py:2103: RuntimeWarning: invalid value encountered in true_divide
  return word_vec / len(ngram_hashes)

So the bug @menshikh-iv described is fixed, but the fix uncovered a divide-by-zero case, because there can be zero ngrams extracted from a single space character. We have several ways to handle this:

  1. Do nothing
  2. Avoid the division, return the origin vector
  3. Raise an exception

@piskvorky Which one do you think is best?

piskvorky commented 5 years ago

No idea. What does FB's FT do?

mpenkov commented 5 years ago
word
word 0.039659 0.094314 0.057308 0.060724 0.026905 

word space
word 0.039659 0.094314 0.057308 0.060724 0.026905 
space -0.039249 0.032824 0.047025 0.065525 0.055325 
^C

The command-line utility ignores the request for a vector for a blank space. So if you say "give me the vector for a blank space", the utility just stares back at you. If you give it a term with spaces, they first split the term into subterms by spaces, and return the vectors for each subterm.

piskvorky commented 5 years ago

OK, thanks. And what does their API (as opposed to CLI) do?

gojomo commented 5 years ago

Also: with the default min_n=4, any single-character string, even with the '<' and '>' end-bumpers added to distinguish leading/ending n-grams, will decompose into zero n-grams to look up.

So what does FT CLI do for single-character OOV words with no n-grams to look up?

We should probably do similar for OOV tokens with zero relevant n-grams, like '' (empty-string), ' ' (single space), 'j' (single-character). (An OOV token like 'qz' would be padded to '\<qz>', which would yield one 4-char n-gram '\<qz>' capable of being looked up, and get whatever n-gram vector happens to be at that bucket, trained or not.)

piskvorky commented 5 years ago

Regarding roadmap, I found this: https://github.com/RaRe-Technologies/gensim/wiki/Roadmap-(2018) (our planned roadmap for 2018).

Everything in there still stands (not much progress in 2018). Especially the clean up and discoverability.

Let me just rename it to "Roadmap / priorities for 2019".

mpenkov commented 5 years ago

Fixed by #2411