Closed ghost closed 4 years ago
Left out a key point, your tutorial.ipynb fails if you used uSIF instead of SIF because of this. (See error dump below)
Hi @mwade625,
about your first point: Could you try again with the following argument: uSIF(model=word_vectors, lang_freq="en") Pre-trained models often don't come with frequency information. lang_freq induces word frequency information into a loaded model.
I have to check the second posting though.
Thanks, I was able to work-around this by using the pre-calculated english language frequencies. I was just surprised that the tutorial failed.
@mwade625 Oh yes you are right! I can replicate the error! Much appreciated. Will look into this
@mwade625 I've implemented a fix for this. You will be notified in future to use a model with valid word-frequency information. Furthermore, if you don't, fse will raise a runtime error to properly infer the frequency using lang_freq arg. The tutorial now works as well. Pushed to develop branch.
It appears that when I download any model from the downloader api in gensim or saved a Word2Vec and re-load it using a KeyedVectors format, the vocab object is storing a reverse index in the "count" variable. So for example, if I have 10 words in the model, the first word has a count of 10 and an index of 0.
Using the following code:
word_vectors = api.load('glove-wiki-gigaword-100')
sif_model = uSIF(model=word_vectors)
The word_vectors.wv.vocab shows the first word to be: "the" and the count = 400000 and the index = 0 For each succeeding word in the model the count goes down by one, and the index goes up by 1.
Clearly this is not the frequency information.
I took this example from your jupyter workbook so I am assuming that something has changed with the models themselves? Any guidance on this would be helpful. I CAN create my on word2vec models and it has the frequency values as expected and the precalculation works as expected.
Thanks for any thoughts or guidance on this. Perhaps this is normal that none of these models retain the word frequencies.
Thanks,
Michael Wade