stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

Ability to control the randomization #128

Closed mebrunet closed 4 years ago

mebrunet commented 6 years ago

These changes add some simple features to allow the user to control the randomization. Helpful for repeatability of experiments.

AngledLuffa commented 4 years ago

Weirdly the calculations are not exactly the same when running the same random seed on the same data. I would have thought it would be, since even considering that floating point math is not exact, the same operations done in the same order should have the same result.

mebrunet commented 4 years ago

You need to limit the computation to a single thread for complete reproducibility. Because at the moment there is nothing managing the order in which to apply the updates from different threads. My understanding is that this is a common issue when trying to reproduce multi threaded/process computations. Worker management / ordering might be worth looking into, since the embedding process runs at a crawl with -threads 1. But at least with this PR reproducibility is possible.

I'll see about whitespace.

AngledLuffa commented 4 years ago

Thank you... I missed that it was running multiple threads. Of course that makes sense. I'll recheck with that in mind.

On Tue, Jan 21, 2020 at 5:06 PM mebrunet notifications@github.com wrote:

You need to limit the computation to a single thread for complete reproducibility. Because at the moment there is nothing managing the order in which to apply the updates from different threads. My understanding is that this is a common issue when trying to reproduce multi threaded/process computations. Worker management / ordering might be worth looking into, since the embedding process runs at a crawl with -threads 1. But at least with this PR reproducibility is possible.

I'll see about whitespace.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/pull/128?email_source=notifications&email_token=AA2AYWI32A3R46IP2ZY3MQLQ66L2ZA5CNFSM4FR2A2L2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJR3QWY#issuecomment-576960603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOU4PQHCXRIKR7OXADQ66L2ZANCNFSM4FR2A2LQ .

AngledLuffa commented 4 years ago

I checked in the random seed portion, as mentioned, and gave you credit. Also added that seed = 0 means it chooses a seed based on the system time.

AngledLuffa commented 4 years ago

I think there should be some mechanism so that it doesn't save params if they were loaded

I'll add that

AngledLuffa commented 4 years ago

I think there should be some mechanism so that it doesn't save params if they were loaded

I'll add that

or maybe not - it does make it easy to test and the parameters are different, at least

AngledLuffa commented 4 years ago

I love the fact that you included a test script! I took the liberty of editing the test script to use $(mktemp ...) and of merging in other changes from other pull requests. I made sure to give you credit when checking it in, though. Thanks!