Open darleybarreto opened 3 years ago
Hi there! Glad you like it. So the traffic here is low-volume enough that I'm happy to field these kinds of queries just via issues, rather than creating a whole separate tab. What kind of issue are you having?
The instances of my dataset are multi-dimensional, each feature is a time series sampled at the same rate (at the same time). Given the rate of the sampling, I know when piece of data is missing for any instance, and since all features are sampled together, they are all missing for a particular time. Therefore I found your example of irregular data very useful to me, because I have missing data and the instances may have different sizes.
Unfortunately I can't show the code, but it's very similar to irregular_data.py
, before calling
coeffs = torchcde.hermite_cubic_coefficients_with_backward_differences(x)
x
has the shape N x Time_idx x (N_features + N_masks)
, where
N
is the number of instancesTime_idx
is the index indicating order of the series (from 0-1)N_features
is the number of features of the instances (collected at the same time)N_masks
is the mask given by the cunsum of not nans in the features, in the example you used maska
and maskb
, so it's one for each feature.Calling cdeint
gives:
Traceback (most recent call last):
File "XXXXXX", line YYYY, in forward
zt = torchcde.cdeint(X=X, func=self.func, z0=z0, t=X.interval)
File lib/site-packages/torchcde/solver.py", line 227, in cdeint
out = odeint(func=vector_field, y0=z0, t=t, **kwargs)
File lib/site-packages/torchdiffeq/_impl/adjoint.py", line 198, in odeint_adjoint
ans = OdeintAdjointMethod.apply(shapes, func, y0, t, rtol, atol, method, options, event_fn, adjoint_rtol, adjoint_atol,
File lib/site-packages/torchdiffeq/_impl/adjoint.py", line 25, in forward
ans = odeint(func, y0, t, rtol=rtol, atol=atol, method=method, options=options, event_fn=event_fn)
File lib/site-packages/torchdiffeq/_impl/odeint.py", line 77, in odeint
solution = solver.integrate(t)
File lib/site-packages/torchdiffeq/_impl/solvers.py", line 30, in integrate
solution[i] = self._advance(t[i])
File lib/site-packages/torchdiffeq/_impl/rk_common.py", line 194, in _advance
self.rk_state = self._adaptive_step(self.rk_state)
File lib/site-packages/torchdiffeq/_impl/rk_common.py", line 228, in _adaptive_step
assert t0 + dt > t0, 'underflow in dt {}'.format(dt.item())
AssertionError: underflow in dt nan
That's definitely not enough detail, so please let me know anything else you need to know.
Thanks in advance!
EDIT: Fix errors.
It looks like your tensor x
has the wrong shape. As per here, the tensor must have shape (..., length, channels)
-- in particular it is the penultimate dimension that corresponds to time.
Sorry, there was two errors, it should be N x Time_idx x (N_features + N_masks)
, where Time_idx
is the time index of size Max_series_size
.
There's two main possibilities here. The first is that your dynamics are really stiff, or otherwise maladapted to your choice of numerical solver. Trying adjusting tolerances, changing the integration method, etc.
The second possibility is that you're passing nan
data in somewhere. I suggest running your code in a debugger (python -m pdb -c continue your_script.py
) and stepping through the stack trace to see if that's the case.
Using autograd.detect_anomaly()
around loss.backward()
I see
RuntimeError: Function 'OdeintAdjointMethodBackward' returned nan values in its 2th output.
Segmentation fault (core dumped)
So it seems to be during the backward pass instead of forward, then when the weights are updated, it propagates nan
and affects the dt
there. torchdiffeq
has these solvers:
SOLVERS = {
'dopri8': Dopri8Solver,
'dopri5': Dopri5Solver,
'bosh3': Bosh3Solver,
'fehlberg2': Fehlberg2,
'adaptive_heun': AdaptiveHeunSolver,
'euler': Euler,
'midpoint': Midpoint,
'rk4': RK4,
'explicit_adams': AdamsBashforth,
'implicit_adams': AdamsBashforthMoulton,
# Backward compatibility: use the same name as before
'fixed_adams': AdamsBashforthMoulton,
# ~Backwards compatibility
'scipy_solver': ScipyWrapperODESolver,
}
Would you recommend any? Or should I try all of them? I think the default is dopri5.
Right. So via a debugger or otherwise, try and track down where that nan is coming from. Is it a nan arising from e.g. a division by zero, or is it a nan arising from some nan data accidentally getting passed in, etc?
In terms of solvers: dopri8 is the highest-order/most-accurate solver torchdiffeq supports. You can try that and see if it helps resolve nans due to stiffness issues. You can also try any of the fixed-step solvers (euler, midpoint, rk4) -- these won't do any adaptive stepping so the actual solution accuracy might be slightly questionable, but in doing so they just entirely ignore stiffness issues, which can at least help diagnose whether that's the root cause.
OK, thanks to clarify that! If I remember correctly, I have to add some hooks in backward of OdeintAdjointMethod
, because the debugger can't reach inside. I will add infos here as I progress so it might help others with similar issues in the future.
After a couple days debugging, I found where the nan
s come from.
Testing your example and adding this simple print in here:
def _advance(self, next_t):
"""Interpolate through the next time point, integrating as necessary."""
n_steps = 0
while next_t > self.rk_state.t1:
assert n_steps < self.max_num_steps, 'max_num_steps exceeded ({}>={})'.format(n_steps, self.max_num_steps)
print(f"_advance rk_state.t1: {self.rk_state.t1.item()}, rk_state[1].max(): {self.rk_state[1].max()}") # <------
self.rk_state = self._adaptive_step(self.rk_state)
n_steps += 1
return _interp_evaluate(self.rk_state.interp_coeff, self.rk_state.t0, self.rk_state.t1, next_t)
But when I tried run my code, there's a divergence in this loop.
Would you happen to know what hyperparameters might help here?
Probably you're doing something like using a neural network vector field without a tanh at the end.
Generally speaking you want to constrain the rate of change of hidden state, because of issues like the one you're describing. See Section 6.2 of the original nCDE paper.
You were right, I replaced all activations with tanh
. I also added here:
def integrate(self, t):
solution = torch.empty(len(t), *self.y0.shape, dtype=self.y0.dtype, device=self.y0.device)
solution[0] = self.y0
t = t.to(self.dtype)
self._before_integrate(t)
for i in range(1, len(t)):
print(LOOP INTEGRATE", self.rk_state[1].max(), "\n") # <-----
solution[i] = self._advance(t[i])
return solution
At the 9th epoch, I see this:
The second time it enters integrate
, t1
starts really small and the method breaks even before t1
becomes non-negative. Do you think this happens due to the data dynamics?
To be clear, I don't suggest making all activations tanh
. Rather, parameterising your vector field as something like tanh(mlp(...))
, where mlp
is some MLP using e.g. a softplus activation.
It's certainly possible for this kind of thing to happen due to the data dynamics. Are you normalising your input data? (You should be.)
I'd also suggest trying a fixed solver with a small time step, and just making sure that the output values you get then seem sane. If not, then that's an indication of a possible problem independent of anything to do with the numerical integration.
Are you normalising your input data? (You should be.)
I wasn't 😞 . That did solve my problem! Thank you so much for you attention and help with this!
Hi Patrick,
First I would like to thank you and everyone involved in this package and related research 🎉 ! I've been trying to use it, but I'm seeing some
nan
s and I would really appreciate your (and/or others) insights on this. Therefore I would like to know your thoughts about opening a GH discussions tab for non-issues related conversations, which is basically a forum inside GH.Best regards,