thaipduong / LieGroupHamDL

MIT License
10 stars 1 forks source link

Odeint returns nan or error #1

Closed allegorywrite closed 4 months ago

allegorywrite commented 5 months ago

Hi! When I run train_pend_SO3_friction.py, the return value of odeint becomes nan, and the loss value also becomes nan. Furthermore, when I change the odeint method to dopri5, the following error occurs:

assert t0 + dt > t0, 'underflow in dt {}'.format(dt.item())

Is this a known phenomenon? And is this a bug that can be ignored? It may be due to my environment,because I am running torchdiffeq0.2.3 on ubuntu20.04, python3.10. Thank you for your answer in advance.

thaipduong commented 5 months ago

Hello, Thank you for reaching out. I haven't tried it with dopri5 before so I might need to investigate that error.

Regarding the NaN error, what torch version did you use? For torch 1.10 or higher, we run into NaN error due to numerical issues with float32. You might need to switch to float64 as follows. Training will be slower, though. parser.add_argument('--float', default=**64**, type=int, help='number of gradient steps')

Another option is to switch to torchode (e.g. for quadrotor: https://github.com/thaipduong/LieGroupHamDL/blob/PX4/training/examples/quadrotor_px4/train_quadrotor_SE3_PX4.py). We actually find training with torchode much more stable than with torchdiffeq.

allegorywrite commented 5 months ago

Thank you ! To change float variable to 64 helps nicely for train_pend_SO3_friction.py, but I didn't for my custom environment(quadrotor) even with float 64. So I will try torchode first.

thaipduong commented 5 months ago

For quadrotor, please make sure that the linear and angular velocity is in body-frame. If your custom environment returns world-frame velocity, you might want to transform that to the body-frame (this is because the Hamiltonian dynamics we used have body-frame velocity. ). Otherwise, it might struggle to converge to a correct model, sometimes leading to NaN errors.