oborchers / Fast_Sentence_Embeddings

Compute Sentence Embeddings Fast!
GNU General Public License v3.0
618 stars 83 forks source link

GENSIM KeyedVectors and downloadable Models #16

Closed ghost closed 4 years ago

ghost commented 5 years ago

It appears that when I download any model from the downloader api in gensim or saved a Word2Vec and re-load it using a KeyedVectors format, the vocab object is storing a reverse index in the "count" variable. So for example, if I have 10 words in the model, the first word has a count of 10 and an index of 0.

Using the following code:

word_vectors = api.load('glove-wiki-gigaword-100')
sif_model = uSIF(model=word_vectors)

The word_vectors.wv.vocab shows the first word to be: "the" and the count = 400000 and the index = 0 For each succeeding word in the model the count goes down by one, and the index goes up by 1.

Clearly this is not the frequency information.

I took this example from your jupyter workbook so I am assuming that something has changed with the models themselves? Any guidance on this would be helpful. I CAN create my on word2vec models and it has the frequency values as expected and the precalculation works as expected.

Thanks for any thoughts or guidance on this. Perhaps this is normal that none of these models retain the word frequencies.

Thanks,

Michael Wade

ghost commented 5 years ago

Left out a key point, your tutorial.ipynb fails if you used uSIF instead of SIF because of this. (See error dump below)

in ----> 1 model.train(s) 2 ~/PycharmProjects/fse_test/venv/lib/python3.7/site-packages/fse/models/base_s2v.py in train(self, sentences, update, queue_factor, report_delay) 640 641 # Preform post-tain calls (i.e principal component removal) --> 642 self._post_train_calls() 643 644 self._log_train_end(eff_sentences=eff_sentences, eff_words=eff_words, overall_time=overall_time) ~/PycharmProjects/fse_test/venv/lib/python3.7/site-packages/fse/models/usif.py in _post_train_calls(self) 79 """ Function calls to perform after training, such as computing eigenvectors """ 80 if self.components > 0: ---> 81 self.svd_res = compute_principal_components(self.sv.vectors, components=self.components) 82 self.svd_weights = (self.svd_res[0] ** 2) / (self.svd_res[0] ** 2).sum().astype(REAL) 83 remove_principal_components(self.sv.vectors, svd_res=self.svd_res, weights=self.svd_weights, inplace=True) ~/PycharmProjects/fse_test/venv/lib/python3.7/site-packages/fse/models/utils.py in compute_principal_components(vectors, components) 32 start = time() 33 svd = TruncatedSVD(n_components=components, n_iter=7, random_state=42, algorithm="randomized") ---> 34 svd.fit(vectors) 35 elapsed = time() 36 logger.info(f"computing {components} principal components took {int(elapsed-start)}s") ~/PycharmProjects/fse_test/venv/lib/python3.7/site-packages/sklearn/decomposition/truncated_svd.py in fit(self, X, y) 139 Returns the transformer object. 140 """ --> 141 self.fit_transform(X) 142 return self 143 ~/PycharmProjects/fse_test/venv/lib/python3.7/site-packages/sklearn/decomposition/truncated_svd.py in fit_transform(self, X, y) 158 """ 159 X = check_array(X, accept_sparse=['csr', 'csc'], --> 160 ensure_min_features=2) 161 random_state = check_random_state(self.random_state) 162 ~/PycharmProjects/fse_test/venv/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator) 540 if force_all_finite: 541 _assert_all_finite(array, --> 542 allow_nan=force_all_finite == 'allow-nan') 543 544 if ensure_min_samples > 0: ~/PycharmProjects/fse_test/venv/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan) 54 not allow_nan and not np.isfinite(X).all()): 55 type_err = 'infinity' if allow_nan else 'NaN, infinity' ---> 56 raise ValueError(msg_err.format(type_err, X.dtype)) 57 # for object dtype data, we only check for NaNs (GH-13254) 58 elif X.dtype == np.dtype('object') and not allow_nan: ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
oborchers commented 4 years ago

Hi @mwade625,

about your first point: Could you try again with the following argument: uSIF(model=word_vectors, lang_freq="en") Pre-trained models often don't come with frequency information. lang_freq induces word frequency information into a loaded model.

I have to check the second posting though.

ghost commented 4 years ago

Thanks, I was able to work-around this by using the pre-calculated english language frequencies. I was just surprised that the tutorial failed.

oborchers commented 4 years ago

@mwade625 Oh yes you are right! I can replicate the error! Much appreciated. Will look into this

oborchers commented 4 years ago

@mwade625 I've implemented a fix for this. You will be notified in future to use a model with valid word-frequency information. Furthermore, if you don't, fse will raise a runtime error to properly infer the frequency using lang_freq arg. The tutorial now works as well. Pushed to develop branch.