Closed ndrewl closed 4 years ago
Mind rebasing the two "Add gradient components" changes?
BTW please excuse the delay
@AngledLuffa, done.
Here's the main problem - having just started looking at this project, I have no idea if 10 is a reasonable clip or if there should even be a default clip value. Do you have a justification for 10 as the default maximum gradient?
Thank you for the pull request and for simplifying it at my request, btw.
Good point. No, there is no theoretical justification for 10. I've chosen some finite number that works in my case and doesn't break the GloVe test pipeline. I agree that 10 seems a bit "magical" though. But is it more magical than eta=0.05, or vector_size=50?
I think there should be some default clipping, because otherwise GloVe will produce NaNs on some data, and this is the problem I'm trying to address in this PR.
I can move the hardcoded 10 out from glove.c to the demo.sh
script, and set it to -1 in glove.c
(meaning "disabled"). This way we'll have an option to disable gradient clipping.
What do you think?
I think I prefer it as a constant in glove.c. I just want to run it on a smallish dataset and print out the largest gradient so i have an idea on what kind of values are normal.
On Tue, Jan 21, 2020 at 6:08 AM Andrew notifications@github.com wrote:
I can move the hardcoded 10 out from glove.c to the demo.sh script, and set it to -1 in glove.c (meaning "disabled"). This way we'll have an option to disable gradient clipping. What do you think?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/pull/149?email_source=notifications&email_token=AA2AYWMTTMYGR674Q4JACJLQ636XPA5CNFSM4IF3ETJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJP3LZY#issuecomment-576697831, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWJNUA4LSTGONRA3HATQ636XPANCNFSM4IF3ETJA .
On the sample dataset, the max value is ~0.6 or 0.7, so maybe I'll make it 100 so I think it's less likely to clip under normal circumstances.
Thanks for the fix!
This PR fixes two issues with NaNs during training. First, the checks for NaNs in glove.c did not work with compiler options from the Makefile. -Ofast disables checks for NaNs and +-Infs. Changed to -O3 by default, so that users at least see the error messages during training. Second, default learning rate causes on some data exploding gradients, and, as a result, +-Infs and NaNs. People on the internet recommend lowering learning rate in such a case, but this solution is not perfect (users have to get NaNs in their embeddings, then to google for a solution, then train again). Now, there is a gradient components clipping parameter with a reasonable default which allows training with arbitrary initial learning rates (including default) on arbitrary data out of the box, and no fiddling is required.