piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.
https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/
GNU Lesser General Public License v2.1
980 stars 131 forks source link

Pretrained FastText doesn't handle OOV words #34

Closed lambdaofgod closed 5 months ago

lambdaofgod commented 5 years ago

Loading FastText using gensim.downloader returns KeyedVectors object. Why is that? In the model name (fasttext-wiki-news-subwords-300) it seems like it should be able to use algorithm's ability to encode OOV words, but now it doesn't do that.

Also, loading downloaded model (from path returned from gensim_data_downloader) using gensim.models.FastText doesn't work.

piskvorky commented 5 years ago

Thanks for reporting. That does sound like a bug to me. CC @mpenkov can you please have a look?

mpenkov commented 5 years ago

I agree that it sounds like a bug.

@lambdaofgod Could you please provide a reproducible example?

lambdaofgod commented 5 years ago
import gensim.downloader
model = gensim.downloader.load('fasttext-wiki-news-subwords-300')
model
<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x7fc5b3964f60>
model.get_vector('dogge')
KeyError: "word 'dogge' not in vocabulary"

Which is not something that you expect from a method that uses subword information.

mpenkov commented 5 years ago

Thank you for providing the reproducible example. Can you please include the full stack trace?

lambdaofgod commented 5 years ago
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-17-c007c0b2c10b> in <module>()
----> 1 fasttext_w2v_format.get_vector('dogge')

1 frames
/usr/local/lib/python3.6/dist-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
    450             return result
    451         else:
--> 452             raise KeyError("word '%s' not in vocabulary" % word)
    453 
    454     def get_vector(self, word):

KeyError: "word 'dogge' not in vocabulary"

I think it's pretty self-explanatory from what I posted before that model uses incorrect wrapper, as it uses gensim.models.keyedvectors.Word2VecKeyedVectors instead of gensim.models.FastText

GladiatorX commented 5 years ago

Will it be resolved in future release?

piskvorky commented 5 years ago

Ping @mpenkov -- this is the same issue as on that mailing list (I knew I already saw it somewhere!). Really confusing behaviour.

lambdaofgod commented 5 years ago

@piskvorky @mpenkov could you help me pinpointing the problem? I may be willing to fix it, but for now I don't know where to start because I don't see what code gets called when creating the object in the downloader

piskvorky commented 5 years ago

AFAIR, it's this code in __init__.py inside the fasttext-wiki-news-subwords-300 release: https://github.com/RaRe-Technologies/gensim-data/releases/tag/fasttext-wiki-news-subwords-300

@mpenkov can you confirm?

a66as commented 5 years ago

Hello from a random user. I am trying to get vectors from the said model but it produces an error saying that the word is not found in vocab. What should I do? Wait for a fix? Thanks

a66as commented 4 years ago

Any updates on this issue?

piskvorky commented 4 years ago

I'm curious too. @mpenkov can you please have a look? I know you reworked and clarified our FastText recently, thanks.

mpenkov commented 4 years ago

Sorry, I'm a little overwhelmed at the moment with work, travel and general end-of-year life-stuff. I'll have a look at this when I can, but I hope no-one out there is holding their breath :)

cephcyn commented 3 years ago

Has there been any update on this issue since 2019?