Open pgupta2050 opened 3 months ago
Tagging @zzhou292 @Huzaifg in case notifications are turned off.
Yes, we have noticed the same issue with gator demo. The memory leak issue is spawned from Project Chrono and not gym-chrono specific. The issue only seems to happen when chrono::sensor is enabled, indicating memory leak with chrono::sensor.
We are investigating on potential fix. For now, automated relaunching and reloading the neural network might be the only solution.
Thanks @zzhou292 I do have Chrono setup and built with Chrono::Sensor. But in some gym-chrono training environments with Chrono::Vehicle, I do not import Chrono::Sensor, and I still see memory leaking (larger leak if I use the SCM terrain). Would the memory leak still happen if I don't import Chrono::Sensor module? I would not have expected it to be so, if the sensor module were at fault. What do you think?
@pgupta2050 We have been running some tests again and you are right, the leak is not specific to Chrono::Sensor. However, Chrono::Sensor is the one that contributes the most. We have also found that its not just because of the Python wrapper but is an existing problem with even the C++ Project Chrono module. This kind of problem only shows up in RL kind of problems, where we repeatedly attempt to destroy and reinitialize the Chrono system.
So in short, you will still face leaks even if you don't import Chrono::Sensor, albeit smaller ones.
For next steps, as @zzhou292 suggested, our current solution is baby sitting the RL runs - Save checkpoints and reload every time we are close to the memory limit. @StefanCaldararu made a script for that - mind adding it here?
With regards to the fix for the memory leak - This will require a lot of work and I don't think we will be able to pull it off any time soon. Additionally, I don't think we can guarantee a complete fix since we have noticed that a lot of the Chrono::Sensor leaks come from the 3rd party ray tracing library we use, Optix.
OK, got it. Thanks for checking it out at your end!
@pgupta2050 sorry I was locked out of my account for a bit, the script I wrote to avoid some of these issues can be found here
You also need to modify the training script to take the loaded checkpoint as an argument, as done here
All it does is fully reset the environment / process being run every few checkpoints.
Hello,
I noticed that if I use vehicle models with the ChronoBaseEnv(), there is a memory leak in the vectorized environments. I have noticed this for the offroad_gator.py and an env I created using the hmmwv vehicle. The individual environments keep growing their memory usage. It seems as if some environment resources are not being released at
env.reset()
. The other env models such ascobra_wpts.py
dont seem to leak memory, however. Is this a known issue? Does anyone have thoughts about debugging this? Maybe something to do with thegym.Env.close()
implemention?I am using
top
utility to monitor the memory usage and the/usr/bin/python3 -c from multiprocessing.forkserver ...
nodes are the ones that grow in memory usage until the system runs out of memory and crashes. An example of where I monitor this (in this case, just re-ran cobra_wpts_train.py to reproduce results) :My training and system info:
Thanks.