Issues with Resets and Memory Leak in Default Training

isaac-racine commented 2 weeks ago

Hello,

I'm testing the default training configuration "combo_go2ARX5_pickle_reaching_extreme" and ran into some issues that I could use help with.

Expected Training Outcome: Without modifying the code, should the robot be able to follow the tossing end-effector (EE) trajectory? For me at around the 200 iteration mark, the robots start resetting instantaneously, seemingly due to a termination criterion. This behavior continues until the end of the full 20,000 iterations. So the result are not good.

GPU Memory Leak: Also starting at around the 200 iteration mark, GPU memory usage steadily increases over several hundred iterations until the training crashes. I made a modification in env.py to address this:

Original:

self.obs_history = torch.cat((self.obs_history[:, 1:, :], obs.unsqueeze(1)), dim=1)

Modified:

    self.obs_history[:, :-1, :] = self.obs_history[:, 1:, :]
    self.obs_history[:, -1, :] = obs

This change seems to prevent the memory increase, but training results remain the same. Do you have any insights on this issue?

System Specs:

OS: Ubuntu 22.04
GPU: NVIDIA GeForce RTX 4090 (25 GB memory)
Environment: Miniconda3 with IsaacGym_Preview_4

Thank you very much for your help!

huy-ha commented 2 weeks ago

I don't expect the instant resets. Just to clarify, you are using the exact code version on master right now with no modifications, the configuration in combo_go2ARX5_pickle_reaching_extreme without overriding any hyperparameters other than the ones included in the default command, and using our task trajectory dataset?

For clarity, this is the default command with default overrides I provided in the README

python scripts/train.py env.sim_device=cuda:0 env.graphics_device_id=0 env.tasks.reaching.sequence_sampler.file_path=data/tossing.pkl

For the GPU memory leak, I've observed Isaac Gym leaks memory due to contacts. It can happen much much later in training (say 20k iterations), but 200 is way too early.

I have trained on a very similar if not identical system setup before, so I don't believe it's a systems issue.

isaac-racine commented 2 weeks ago

Thank you for the fast reply!

Yes, I am using the current code on the main branch with no modifications, with your tossing.pkl dataset and using the default command that you showed. I am gonna try it on a different pc with a new ubuntu 22.04 and let you know if it works properly.

isaac-racine commented 1 week ago

Update

So I tried to run the training on a new pc and the results are the same. Without changing the code the gpu memory starts to increase and the training crashes a little bit after 500 iterations. The training seems to go well before that so it must be the contacts.

yolo01826 commented 1 week ago

I encountered the same issue in the cup placement task without modifying the code. （ubuntu20.04，RTX4090）

huy-ha commented 1 week ago

Ah! When using cup in the wild trajectories, you should add some z height to the trajectory. For instance, adding 'env.tasks.reaching.sequence_sampler. add_random_height_range=[0.4,0.5]' to the command should randomly add 40cm to 50cm to the trajectory z index, which is more realistic for this particular task. This would also prevent the robot from crawling on the ground the entire time, which should prevent contact buffers from taking up so much memory.

yolo01826 commented 1 week ago

thanks a lot, it works🌹

real-stanford / umi-on-legs

Issues with Resets and Memory Leak in Default Training #8