Memory leak in offroad_gator.py?

pgupta2050 commented 3 months ago

Hello,

I noticed that if I use vehicle models with the ChronoBaseEnv(), there is a memory leak in the vectorized environments. I have noticed this for the offroad_gator.py and an env I created using the hmmwv vehicle. The individual environments keep growing their memory usage. It seems as if some environment resources are not being released at env.reset(). The other env models such as cobra_wpts.py dont seem to leak memory, however. Is this a known issue? Does anyone have thoughts about debugging this? Maybe something to do with the gym.Env.close() implemention?

I am using top utility to monitor the memory usage and the /usr/bin/python3 -c from multiprocessing.forkserver ... nodes are the ones that grow in memory usage until the system runs out of memory and crashes. An example of where I monitor this (in this case, just re-ran cobra_wpts_train.py to reproduce results) :

top - 11:17:39 up 18:46,  1 user,  load average: 18.79, 16.91, 15.91
Tasks: 1366 total,   6 running, 1360 sleeping,   0 stopped,   0 zombie
%Cpu(s): 15.7 us,  0.0 sy,  0.0 ni, 84.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 515580.9 total, 488939.0 free,  15644.0 used,  10997.9 buff/cache
MiB Swap:   2048.0 total,   1606.5 free,    441.5 used. 496004.3 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                                                                                                                                        
  81225 cra       20   0 1645756   1.5g  15624 S   0.0   0.3   0:26.70 /home/cra/.vscode/extensions/ms-vscode.cpptools-1.19.9-linux-x64/bin/cpptools                                                                                                                                                                                                                  
  66581 cra       20   0   21.1g   1.2g 402612 S   0.3   0.2  25:33.27 python3 cobra_wpts_train.py                                                                                                                                                                                                                                                                    
  69302 cra       20   0 1125.8g 718384  63300 S   0.0   0.1   7:03.47 /snap/code/155/usr/share/code/code /home/cra/.vscode/extensions/ms-python.vscode-pylance-2024.4.1/dist/server.bundle.js --cancellationReceive=file:bc412ecaf76652112fa93809a8f24e90483ce3ef52 --node-ipc --clientProcessId=68584                                                               
  84540 cra       20   0 4117136 591932 321480 S   0.3   0.1   0:37.34 /usr/lib/firefox/firefox -new-window                                                                                                                                                                                                                                                           
  68457 cra       20   0 1136.1g 550088 105772 S   0.0   0.1  12:15.34 /snap/code/155/usr/share/code/code --type=renderer --crashpad-handler-pid=68425 --enable-crash-reporter=c6aaa28c-db04-43a0-a4e3-9668ea18dfb3,no_channel --user-data-dir=/home/cra/.config/Code --standard-schemes=vscode-webview,vscode-file --secure-schemes=vscode-webview,vscode-file --co+ 
  66660 cra       20   0 5590972 534512 234492 S  77.9   0.1 207:05.27 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66658 cra       20   0 5590212 533708 234344 S  89.4   0.1 209:01.33 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66671 cra       20   0 5590656 533524 233924 S  87.5   0.1 203:24.93 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66652 cra       20   0 5590448 533512 233764 S  91.7   0.1 219:50.28 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66683 cra       20   0 5591232 533460 234220 R  87.5   0.1 204:19.76 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66664 cra       20   0 5590984 533380 233356 S  68.6   0.1 205:28.67 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66657 cra       20   0 5590208 533320 233960 S  92.7   0.1 208:44.55 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66649 cra       20   0 5591316 533312 232844 R  95.0   0.1 223:04.27 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66670 cra       20   0 5591084 533012 232884 S  78.9   0.1 204:00.27 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66659 cra       20   0 5590056 532832 233632 S  77.6   0.1 208:21.62 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66663 cra       20   0 5589696 532388 233960 S  72.3   0.1 206:55.54 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66653 cra       20   0 5589676 532336 234452 S  93.7   0.1 220:54.75 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66666 cra       20   0 5590196 532204 232860 S  86.5   0.1 204:40.95 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66651 cra       20   0 5590044 532104 232916 S  91.1   0.1 219:27.75 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66655 cra       20   0 5589284 531272 233912 R  80.5   0.1 217:58.57 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66662 cra       20   0 5588768 531052 233708 S  90.8   0.1 206:56.47 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66668 cra       20   0 5589972 530908 231796 S  69.3   0.1 204:51.29 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66656 cra       20   0 5590052 530280 231084 S  93.1   0.1 212:47.25 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66667 cra       20   0 5588568 530136 233784 S  67.3   0.1 203:21.30 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66661 cra       20   0 5588812 529720 233124 R  86.8   0.1 206:36.71 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66654 cra       20   0 5588412 529376 233188 S  80.2   0.1 218:59.53 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66669 cra       20   0 5590308 528620 229164 S  79.5   0.1 204:37.67 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66665 cra       20   0 5588840 527992 230576 S  76.6   0.1 204:15.16 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
  66650 cra       20   0 5585680 526760 233088 R  92.7   0.1 221:35.81 /usr/bin/python3 -c from multiprocessing.forkserver import main; main(52, 54, ['__main__'], **{'sys_path': ['/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono/gym_chrono/train', '/home/cra/chrono-ws/gym-chrono', '/home/cra/chrono-ws/chrono_build/bin', '+ 
   3793 cra       20   0 4637840 379480  98272 S   1.7   0.1   8:24.79 /usr/bin/gnome-shell

My training and system info:

- OS: Linux-5.15.0-101-generic-x86_64-with-glibc2.29 # 111~20.04.1-Ubuntu SMP Mon Mar 11 15:44:43 UTC 2024
- Python: 3.8.10
- Stable-Baselines3: 2.2.1
- PyTorch: 2.2.1+cu121
- GPU Enabled: True
- Numpy: 1.24.0
- Cloudpickle: 3.0.0
- Gymnasium: 0.29.1

Thanks.

pgupta2050 commented 3 months ago

Tagging @zzhou292 @Huzaifg in case notifications are turned off.

zzhou292 commented 3 months ago

Yes, we have noticed the same issue with gator demo. The memory leak issue is spawned from Project Chrono and not gym-chrono specific. The issue only seems to happen when chrono::sensor is enabled, indicating memory leak with chrono::sensor.

We are investigating on potential fix. For now, automated relaunching and reloading the neural network might be the only solution.

pgupta2050 commented 3 months ago

Thanks @zzhou292 I do have Chrono setup and built with Chrono::Sensor. But in some gym-chrono training environments with Chrono::Vehicle, I do not import Chrono::Sensor, and I still see memory leaking (larger leak if I use the SCM terrain). Would the memory leak still happen if I don't import Chrono::Sensor module? I would not have expected it to be so, if the sensor module were at fault. What do you think?

Huzaifg commented 3 months ago

@pgupta2050 We have been running some tests again and you are right, the leak is not specific to Chrono::Sensor. However, Chrono::Sensor is the one that contributes the most. We have also found that its not just because of the Python wrapper but is an existing problem with even the C++ Project Chrono module. This kind of problem only shows up in RL kind of problems, where we repeatedly attempt to destroy and reinitialize the Chrono system.

So in short, you will still face leaks even if you don't import Chrono::Sensor, albeit smaller ones.

For next steps, as @zzhou292 suggested, our current solution is baby sitting the RL runs - Save checkpoints and reload every time we are close to the memory limit. @StefanCaldararu made a script for that - mind adding it here?

With regards to the fix for the memory leak - This will require a lot of work and I don't think we will be able to pull it off any time soon. Additionally, I don't think we can guarantee a complete fix since we have noticed that a lot of the Chrono::Sensor leaks come from the 3rd party ray tracing library we use, Optix.

pgupta2050 commented 3 months ago

OK, got it. Thanks for checking it out at your end!

StefanCaldararu commented 3 months ago

@pgupta2050 sorry I was locked out of my account for a bit, the script I wrote to avoid some of these issues can be found here

You also need to modify the training script to take the loaded checkpoint as an argument, as done here

All it does is fully reset the environment / process being run every few checkpoints.

projectchrono / gym-chrono

Memory leak in offroad_gator.py? #14