Closed rubick1896 closed 4 years ago
I have been getting "underflow in dt 0.0" after a few epochs of training. I am using Adam and lr of 1e-5(decreased from 1e-3, still not working). Any idea why this is happening? Do you have any suggestions to avoid this type of error?
This type of error is caused by "blow up" of the ordinary equations. And if you use adam methods, the step size will become too small to continue, and the code will return you the "underflow" error.
size will become too small to conti
What exactly does "blow up" mean? What can I do to prevent this error from happening?
size will become too small to conti
What exactly does "blow up" mean? What can I do to prevent this error from happening?
It is the problem of the stability of ordinary equations (I will call it ODEs for simplicity), so if you want to prevent it, here are two common ways: (1) make ODEs more stable, sorry for that I am busy now, and you can use "blow up ODEs" as the keywords in google, and the results in the first page may give you a rough comprehension. (2) If you use fixed step numerical methods (RK4 etc), even the number becomes too large, "dt" will not become too small to continue. But the side effect is that it will return you NaN if you have large integrating intervals. Maybe you can avoid some points or set your integrate interval not too large. We have met the same problem, in our work "https://arxiv.org/pdf/2005.04849.pdf", we use small interval and RK4 to avoid the blow up and "dt" too small. (We don't talk too much the stability in the article, so just for an ad :)
First of all, I just want to make clear that this isn't an issue with the optimiser, or the learning rate, despite what the previous commenter says. I don't think I believe their comments about numerical methods either, but that is at least related to how to fix this.
In terms of what's going wrong - you're solving the CDE with an adaptive solver that accepts very little error. Whatever CDE you have defined, however, is too hard to solve whilst only making that small an error, and this underflow appears as a result.
So, to fix this, you've got a few options.
Option 1 is just to catch the exception, ignore it, and keep training. Very often as a result of using a different batch, or by the model weights getting adjusted, then the model becomes solvable again. That said, I wouldn't usually recommend this option.
Option 2 is simply to decrease the tolerance. if you do cdeint(..., atol=0.01, rtol=0.01)
for example, then you allow the adaptive solver to make larger errors in the forward pass. Note that this 'larger error' is a larger error with respect to the mathematical description of the differential equation. It need not affect empirical performance at all. This is a perfectly sensible resolution to the problem.
Option 3 is to switch to a fixed solver. The way in which it's going to solve the differential equation is fixed in advance, so it never keeps track of error at all. This may make an arbitrarily large error (again, with respect to the mathematical description of the differential equation), but again, that isn't related to empirical performance. This can be done via
cdeint(... method='rk4', options=dict(step_size=<some number>))
where you need to pick a step size for the fixed solver. This is also a perfectly sensible resolution to the problem.
Option 3 is actually what we use in the paper, where we take the step size to equal the smallest gap between observations.
It's worth noting that neither the problem nor the solutions are specific to Neural CDEs. They all apply to the original Neural ODEs as well! If you've used the torchdiffeq
library before then the the arguments used in options 2 and 3 should seem familiar, as it's what you would do in that library as well.
I have been getting "underflow in dt 0.0" after a few epochs of training. I am using Adam and lr of 1e-5(decreased from 1e-3, still not working). Any idea why this is happening? Do you have any suggestions to avoid this type of error?