oborchers / Fast_Sentence_Embeddings

Compute Sentence Embeddings Fast!
GNU General Public License v3.0
619 stars 83 forks source link

Usif does not work with small data? #31

Closed lfomendes closed 4 years ago

lfomendes commented 4 years ago

I'm trying to test the usif but I'm getting an error in the SVD part about nan values in the vector. I took the example of the Average and changed to usif

` from gensim.models import FastText sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]] ft = FastText(sentences, min_count=1, size=10)

from fse.models import uSIF from fse import IndexedList model = uSIF(ft, components=1) model.train(IndexedList(sentences)) `

The error is the following ocurring during the fit() of the TruncatedSVD

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

I'm using Python 3.6.5 :: Anaconda, Inc.

oborchers commented 4 years ago

Yes! The following line is missing:

model = uSIF(ft, components=1, lang_freq="en")

I've added an error when you miss out to specify the argument:

RuntimeError: Encountered nan values. This likely happens because the word frequency information is wrong/missing.    Consider restarting using lang_freq argument to infer frequency.

Should lead to:

>>> model = uSIF(ft, components=1, lang_freq="en")
>>> model.train(IndexedList(sentences))
(2, 6)