stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

Fix nan during training #149

Closed ndrewl closed 4 years ago

ndrewl commented 5 years ago

This PR fixes two issues with NaNs during training. First, the checks for NaNs in glove.c did not work with compiler options from the Makefile. -Ofast disables checks for NaNs and +-Infs. Changed to -O3 by default, so that users at least see the error messages during training. Second, default learning rate causes on some data exploding gradients, and, as a result, +-Infs and NaNs. People on the internet recommend lowering learning rate in such a case, but this solution is not perfect (users have to get NaNs in their embeddings, then to google for a solution, then train again). Now, there is a gradient components clipping parameter with a reasonable default which allows training with arbitrary initial learning rates (including default) on arbitrary data out of the box, and no fiddling is required.

AngledLuffa commented 4 years ago

Mind rebasing the two "Add gradient components" changes?

BTW please excuse the delay

ndrewl commented 4 years ago

@AngledLuffa, done.

AngledLuffa commented 4 years ago

Here's the main problem - having just started looking at this project, I have no idea if 10 is a reasonable clip or if there should even be a default clip value. Do you have a justification for 10 as the default maximum gradient?

AngledLuffa commented 4 years ago

Thank you for the pull request and for simplifying it at my request, btw.

ndrewl commented 4 years ago

Good point. No, there is no theoretical justification for 10. I've chosen some finite number that works in my case and doesn't break the GloVe test pipeline. I agree that 10 seems a bit "magical" though. But is it more magical than eta=0.05, or vector_size=50?

I think there should be some default clipping, because otherwise GloVe will produce NaNs on some data, and this is the problem I'm trying to address in this PR.

ndrewl commented 4 years ago

I can move the hardcoded 10 out from glove.c to the demo.sh script, and set it to -1 in glove.c (meaning "disabled"). This way we'll have an option to disable gradient clipping. What do you think?

AngledLuffa commented 4 years ago

I think I prefer it as a constant in glove.c. I just want to run it on a smallish dataset and print out the largest gradient so i have an idea on what kind of values are normal.

On Tue, Jan 21, 2020 at 6:08 AM Andrew notifications@github.com wrote:

I can move the hardcoded 10 out from glove.c to the demo.sh script, and set it to -1 in glove.c (meaning "disabled"). This way we'll have an option to disable gradient clipping. What do you think?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/pull/149?email_source=notifications&email_token=AA2AYWMTTMYGR674Q4JACJLQ636XPA5CNFSM4IF3ETJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJP3LZY#issuecomment-576697831, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWJNUA4LSTGONRA3HATQ636XPANCNFSM4IF3ETJA .

AngledLuffa commented 4 years ago

On the sample dataset, the max value is ~0.6 or 0.7, so maybe I'll make it 100 so I think it's less likely to clip under normal circumstances.

Thanks for the fix!