Closed jimmiebtlr closed 3 years ago
Hi @jimmiebtlr, Thanks for opening the issue! One quick test - would you mind running with AdamW (i.e. use_madgrad=False, or just omit the param). The madgrad side of the code is less explored atm vs AdamW core so if it reproes only one side that would be helpful to know where to investigate.
I'll give that a shot, it takes some time to reproduce, so will post back when I find one way or another. Thanks!
Rarely coming up without the use_madgrad param as well. Will make a local copy and see if I can track it down more.
Definetly happening in the gradient calculation somewhere.
Hi @jimmiebtlr - I've just checked in a softplus transform update along with gradient normalization, that should be helpful here as it boosts very small values, and set a new high on our small benchmark dataset with it.
@jimmiebtlr and @lessw2020
I also jump into this issue, Running with SGD without any problem, So my loss function would be fine enough.
any update on this?
The issue was in my own code from what I recall.
Not certain this is a bug yet, but I'm getting this rarely after awhile of training and am not finding an issue in my side. Input to loss function looks good (no nan's). I'm working with a fairly complex loss function though, so very possible I have a rare bug in my code.
I'm using the following options
I've seen this with a batch size of 4-128 so far, so doesn't seem to be dependent on that.