oborchers / Fast_Sentence_Embeddings

Compute Sentence Embeddings Fast!
GNU General Public License v3.0
616 stars 83 forks source link

Ordering of sentences trained on matters for the inferred vectors. #46

Closed Filco306 closed 2 years ago

Filco306 commented 3 years ago

Hello,

First of all, thank you for a nice repository. I am however a bit troubled about one thing, which I hope to get answered here.

The order in which the data is inputted seem to matter for the outcome of the vectors; at least for the uSIF embedding function.

Consider the example below.

from fse.models import uSIF
from fse import IndexedList
import gensim.downloader as api

def load_w2vec(vecs: str = "word2vec-google-news-300"):
    model = api.load(vecs)
    return model

glove = load_w2vec("glove-wiki-gigaword-100")
data = [["Hello", "there", "John"], ["Hi","everyone", "good", "day"]]
input_1 = IndexedList(data)
model = uSIF(glove, lang_freq="en")
model.train(input_1)
vecs = model.infer(input_1)

model.train(input_1)
vecs2 = model.infer(input_)

print(f"All vectors are the same: {np.all(vecs == vecs2)}")

# Feed the model the same data for training but in another order. 
input_2 = IndexedList(data[::-1])

model = uSIF(glove, lang_freq="en")
model.train(input_2)
vecs2 = model.infer(input_1) # Take the same vectors but in the original order and infer these. 
print(f"All vectors are the same: {np.all(vecs == vecs2)}")

Gives me the output

All vectors are the same: True
All vectors are the same: False

Should this really be the case? Thank you in advance!

Filco306 commented 3 years ago

I can add that this is the case, even if I set a seed and run these two separately.

grantmwilliams commented 2 years ago

Hey @Filco306 i was curious about this and give it a look and it looks to me that this is simply a precision issue. If instead of using np.all(vecs == vecs2) you try using assert_allclose(vecs, vecs2, atol=1e-5) from numpy's testing library you'll see it asserts true.

As an example I get print(vecs-vecs2):

[[ 0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   7.4505806e-09  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  3.7252903e-09
   0.0000000e+00  0.0000000e+00  1.4901161e-08  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00]
 [ 0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00 -9.3132257e-10
   0.0000000e+00  0.0000000e+00 -3.7252903e-09  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  7.4505806e-09  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00 -1.8626451e-09  0.0000000e+00
   1.8626451e-09  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   1.8626451e-09  3.7252903e-09  0.0000000e+00 -7.4505806e-09
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   2.2351742e-08  0.0000000e+00  0.0000000e+00  0.0000000e+00
  -9.3132257e-10  0.0000000e+00  0.0000000e+00  0.0000000e+00
  -9.3132257e-10  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  1.8626451e-09  0.0000000e+00 -9.3132257e-10
   0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00  3.7252903e-09  7.4505806e-09
   0.0000000e+00  0.0000000e+00  4.4703484e-08  2.3283064e-10
   3.7252903e-09  1.8626451e-09  0.0000000e+00  0.0000000e+00]] 

I suspect this is because under the hood fast text uses asynchronous stochastic gradient descent, or Hogwild as the optimization algorithm. From the Gensim documentation you'll see that setting the seed isn't enough to guarantee perfect reproducibility and you also need to set the number of threads to 1 and possibly set the PYTHONHASHSEED env variable.

seed (int, optional) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization).

Filco306 commented 2 years ago

Hello there! Very nice, thank you for this!

Filco306 commented 2 years ago

Then I will consider this closed :)