How to use fastText for out of sample words?

piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.

https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/

GNU Lesser General Public License v2.1

988 stars 133 forks source link

How to use fastText for out of sample words? #26

Closed shgidi closed 6 years ago

shgidi commented 6 years ago

When downloading fastText with this method, we get a folder with a file in standard word2vec format, which can be loaded with model = KeyedVectors.load_word2vec_format(path, binary=False) But not with from gensim.models import FastText model = FastText.load_fasttext_format(path, binary=False)

This disables the ability to get vectors for out-of-vocabulary words. How can this be done correcly?

menshikh-iv commented 6 years ago

@shgidi

Facebook distribute 2 type of files:

.vec contains ONLY word-vectors (no ngrams here), can be loaded with KeyedVectors.load_word2vec_format
.bin contains ngrams, can be loaded with FastText.load_fasttext_format

next time please ask in mailing list mailing list

piskvorky commented 6 years ago

@menshikh-iv is this clear from our documentation?

I see people confused about these formats, how to load them and what can be done with them, all the time.

A clear, authoritative docs section would help us with support too (just point with hyperlink).

menshikh-iv commented 6 years ago

@piskvorky I agree this situation happens sometimes, it worth to make a tutorial.

piskvorky commented 6 years ago

A tutorial would be ideal, but a simple paragraph in the docs would go a long way. Can you add it?

scottlittle commented 6 years ago

This is not working for me with gensim 3.5, python 3.6, and a FB model:

from gensim.models import FastText
model_yelp = FastText.load_fasttext_format('yelp_review_full.bin')

I get: NotImplementedError: Supervised fastText models are not supported

menshikh-iv commented 6 years ago

@scottlittle please read an exception again: we really don't support supervised fasttext models

scottlittle commented 6 years ago

@shgidi https://github.com/facebookresearch/fastText/tree/master/python worked for me.

romass12 commented 6 years ago

What is meant by supervised fasttext models and how to train for unsupervised?

menshikh-iv commented 6 years ago

@romass12

supervised fasttext models

Exactly what supervised learning means. FB implementation have supervised-mode support (gensim - only unsupervised)

how to train for unsupervised

Just read an Gensim FastText documentation