Out of memory when training

chengxuxin commented 2 years ago

The training command does not work on my laptop if --sim_device=cuda. It works if I use --sim_device=cpu. I tried to only use 1 environment, but nothing seems to have changed.

OS Version: Ubuntu 21.04 Nvidia Driver: 470.82.00 Graphics: RTX 3060 Laptop Pytorch: 1.10.0+cu113

(issac) cxx@cxx:~/Documents/Isaac/legged_gym$ python legged_gym/scripts/train.py --task=anymal_c_flat --num_envs 1 --sim_device=cuda --rl_device=cuda
Importing module 'gym_38' (/home/cxx/Documents/Isaac/isaacgym/python/isaacgym/_bindings/linux-x86_64/gym_38.so)
Setting GYM_USD_PLUG_INFO_PATH to /home/cxx/Documents/Isaac/isaacgym/python/isaacgym/_bindings/linux-x86_64/usd/plugInfo.json
PyTorch version 1.10.0+cu113
Device count 1
/home/cxx/Documents/Isaac/isaacgym/python/isaacgym/_bindings/src/gymtorch
Using /home/cxx/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Emitting ninja build file /home/cxx/.cache/torch_extensions/py38_cu113/gymtorch/build.ninja...
Building extension module gymtorch...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module gymtorch...
Setting seed: 1
Not connected to PVD
+++ Using GPU PhysX
Physics Engine: PhysX
Physics Device: cuda:0
GPU Pipeline: enabled
/home/cxx/anaconda3/envs/issac/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
[Error] [carb.gym.plugin] Gym cuda error: out of memory: ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp: 1718
[Error] [carb.gym.plugin] Gym cuda error: invalid resource handle: ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp: 6003
[Error] [carb.gym.plugin] Gym cuda error: out of memory: ../../../source/plugins/carb/gym/impl/Gym/GymPhysXCuda.cu: 991
[Error] [carb.gym.plugin] Gym cuda error: invalid resource handle: ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp: 5859
Traceback (most recent call last):
  File "legged_gym/scripts/train.py", line 47, in <module>
    train(args)
  File "legged_gym/scripts/train.py", line 41, in train
    env, env_cfg = task_registry.make_env(name=args.task, args=args)
  File "/home/cxx/Documents/Isaac/legged_gym/legged_gym/utils/task_registry.py", line 97, in make_env
    env = task_class(   cfg=env_cfg,
  File "/home/cxx/Documents/Isaac/legged_gym/legged_gym/envs/anymal_c/anymal.py", line 49, in __init__
    super().__init__(cfg, sim_params, physics_engine, sim_device, headless)
  File "/home/cxx/Documents/Isaac/legged_gym/legged_gym/envs/base/legged_robot.py", line 75, in __init__
    self._init_buffers()
  File "/home/cxx/Documents/Isaac/legged_gym/legged_gym/envs/anymal_c/anymal.py", line 63, in _init_buffers
    super()._init_buffers()
  File "/home/cxx/Documents/Isaac/legged_gym/legged_gym/envs/base/legged_robot.py", line 505, in _init_buffers
    self.gravity_vec = to_torch(get_axis_params(-1., self.up_axis_idx), device=self.device).repeat((self.num_envs, 1))
  File "/home/cxx/Documents/Isaac/isaacgym/python/isaacgym/torch_utils.py", line 16, in to_torch
    return torch.tensor(x, dtype=dtype, device=device, requires_grad=requires_grad)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

fanshi14 commented 2 years ago

You could reduce num_envs in legged_gym/envs/anymal_c/flat/anymal_c_flat_config.py. The default number is num_envs = 4096, but you could try 2048 or 1024 to save memory. Just add num_envs = 1024 here.

chengxuxin commented 2 years ago

You could reduce num_envs in legged_gym/envs/anymal_c/flat/anymal_c_flat_config.py. The default number is num_envs = 4096, but you could try 2048 or 1024 to save memory. Just add num_envs = 1024 here.

I have tried decreasing num_envs to very small number like 1 or 2. But it still did not work. I tried to see how much memory it takes by setting --sim_device=cpu so the memory allocation does not happen in gpu. What I found is that it takes about 4.5 GB memory which is too much for my gpu. However, increasing num_envs from 1 to 4096 only takes 300 MB more memory. So I am wondering what the 4.5 GB memory is about and it seems to have no relation with num_envs.

fanshi14 commented 2 years ago

I see, I think that problem is caused by isaac gym. If you try the default ANYmal example in isaac gym, it will take more than 4.5 GB GPU memory. Sorry I do not have the answer, maybe you could ask the developers in NVIDIA. And please let me know as well if you find the solution.

HrvojeBogadi commented 2 years ago

I have managed to get some sort of a solution to this problem, at least for my case.

To anyone still interested in what happened...

I tried running the same examples and got the same problems, snooped around a bit and concluded that the training is done perfectly well in headless mode : python legged_gym/scripts/train.py --task=anymal_c_flat --sim_device=cuda --rl_device=cuda --pipeline=gpu --num_envs=2048 --headless

A larger number of envs would probably work but I am heavily limited by my 4GB GTX1050TI.

It obviously does not show the simulation, but it trains everything perfectly well. After the training is done I can run the trained policy using play script on --sim_device=cpu on a small set of envs and see how the robots behave.

HOWEVER.. Setting the --pipeline=cpu and running the script with --sim_device=cuda seems to do the trick!! ( Whole command being: python legged_gym/scripts/train.py --task=anymal_c_flat --sim_device=cuda --rl_device=cuda --pipeline=cpu --num_envs=256) Now, I still cannot run a full simulation with 4096 robots, but for example, 256 works perfectly fine. I am not sure why this happens but the idea for setting the pipeline came from here

It is worth noting that among other things, isaac gym does appear to have memory management problems. After rerunning the script a few times I noticed it fails to start even with 32 envs with "out of memory" problem, so it obviously fails to be freed when it should. Nvidia obviously knows there are some memory management issues but does not want to focus its development on this at this time.

Even though the simulation for training does work this way, I find that the best approach is to let the policy train in the --headless mode using gpu as a sim and rl device and then just play it out using --sim_device=cuda, and --pipeline=gpu. This seems to provide the fastest training times and seems to limit the memory only by the number of envs, and not the existance of simulation itself.

Running the simulation purely on CPU also does the trick but tends to get laggy and have low framerates and therefore slow down the training.

sujitvasanth commented 2 years ago

Hi I have a RTX 3060 desktop and had similar poblems until I upgraded to latest versions of pytorch with CUDA and upgraded the proprietary ubuntu driver to 515. Here is my setup that fully works with --num_envs=1024 oe allowing ninja to choose

to do this on ubuntu go to seetings --> additional drivers --> and select: Nvidia deriver metapackage from nvidia-driver-515 (propritary, tested)

in a web browser: https://pytorch.org/get-started/locally/ and select Linux, Python, CUDA 11.6 which will generate a terminal command pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116 (tested) or conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge (untested)

Modified setup: GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Nvidia driver version: 515.48.07 OS: Ubuntu 20.04.4 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Libc version: glibc-2.31 Python version: 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.13.0-52-generic-x86_64-with-glibc2.29 [pip3] numpy==1.19.5 [pip3] torch==1.12.0+cu116 [pip3] torchaudio==0.12.0+cu116 [pip3] torchvision==0.13.0+cu116 CUDA used to build PyTorch: 11.6 ROCM used to build PyTorch: N/A

you can now simply try e.g. cd Desktop/legged_gym-master/legged_gym/scripts python train.py --task=cassie

Tzxtoiny commented 4 months ago

I met the same problem when trying to run the python play.py --task=anymal_c_flat --sim_device=cuda --rl_device=cuda --pipeline=gpu --num_envs=1

so how do you solve the error?

When I add --headless, it works fine, so is the problem that the rendering takes up too much memory?

@chengxuxin

p.s. I saw your work on parkour at the recent ICRA 2024, great work!

leggedrobotics / legged_gym

Out of memory when training #5