Closed mebrunet closed 4 years ago
Weirdly the calculations are not exactly the same when running the same random seed on the same data. I would have thought it would be, since even considering that floating point math is not exact, the same operations done in the same order should have the same result.
You need to limit the computation to a single thread for complete reproducibility. Because at the moment there is nothing managing the order in which to apply the updates from different threads. My understanding is that this is a common issue when trying to reproduce multi threaded/process computations. Worker management / ordering might be worth looking into, since the embedding process runs at a crawl with -threads 1
. But at least with this PR reproducibility is possible.
I'll see about whitespace.
Thank you... I missed that it was running multiple threads. Of course that makes sense. I'll recheck with that in mind.
On Tue, Jan 21, 2020 at 5:06 PM mebrunet notifications@github.com wrote:
You need to limit the computation to a single thread for complete reproducibility. Because at the moment there is nothing managing the order in which to apply the updates from different threads. My understanding is that this is a common issue when trying to reproduce multi threaded/process computations. Worker management / ordering might be worth looking into, since the embedding process runs at a crawl with -threads 1. But at least with this PR reproducibility is possible.
I'll see about whitespace.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/pull/128?email_source=notifications&email_token=AA2AYWI32A3R46IP2ZY3MQLQ66L2ZA5CNFSM4FR2A2L2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJR3QWY#issuecomment-576960603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOU4PQHCXRIKR7OXADQ66L2ZANCNFSM4FR2A2LQ .
I checked in the random seed portion, as mentioned, and gave you credit. Also added that seed = 0 means it chooses a seed based on the system time.
I think there should be some mechanism so that it doesn't save params if they were loaded
I'll add that
I think there should be some mechanism so that it doesn't save params if they were loaded
I'll add that
or maybe not - it does make it easy to test and the parameters are different, at least
I love the fact that you included a test script! I took the liberty of editing the test script to use $(mktemp ...) and of merging in other changes from other pull requests. I made sure to give you credit when checking it in, though. Thanks!
These changes add some simple features to allow the user to control the randomization. Helpful for repeatability of experiments.