Closed Filco306 closed 2 years ago
I can add that this is the case, even if I set a seed and run these two separately.
Hey @Filco306 i was curious about this and give it a look and it looks to me that this is simply a precision issue. If instead of using np.all(vecs == vecs2)
you try using assert_allclose(vecs, vecs2, atol=1e-5)
from numpy's testing library you'll see it asserts true.
As an example I get print(vecs-vecs2)
:
[[ 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
7.4505806e-09 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 3.7252903e-09
0.0000000e+00 0.0000000e+00 1.4901161e-08 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00]
[ 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 -9.3132257e-10
0.0000000e+00 0.0000000e+00 -3.7252903e-09 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 7.4505806e-09 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 -1.8626451e-09 0.0000000e+00
1.8626451e-09 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
1.8626451e-09 3.7252903e-09 0.0000000e+00 -7.4505806e-09
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
2.2351742e-08 0.0000000e+00 0.0000000e+00 0.0000000e+00
-9.3132257e-10 0.0000000e+00 0.0000000e+00 0.0000000e+00
-9.3132257e-10 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 1.8626451e-09 0.0000000e+00 -9.3132257e-10
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 3.7252903e-09 7.4505806e-09
0.0000000e+00 0.0000000e+00 4.4703484e-08 2.3283064e-10
3.7252903e-09 1.8626451e-09 0.0000000e+00 0.0000000e+00]]
I suspect this is because under the hood fast text uses asynchronous stochastic gradient descent, or Hogwild as the optimization algorithm. From the Gensim documentation you'll see that setting the seed isn't enough to guarantee perfect reproducibility and you also need to set the number of threads to 1 and possibly set the PYTHONHASHSEED
env variable.
seed (int, optional) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization).
Hello there! Very nice, thank you for this!
Then I will consider this closed :)
Hello,
First of all, thank you for a nice repository. I am however a bit troubled about one thing, which I hope to get answered here.
The order in which the data is inputted seem to matter for the outcome of the vectors; at least for the uSIF embedding function.
Consider the example below.
Gives me the output
Should this really be the case? Thank you in advance!