RuntimeError: CUDA error: out of memory

ZhuFengdaaa commented 3 years ago

I face the CUDA out of memory error when I run

python habitat_baselines/run.py --exp-config habitat_baselines/config/multinav/ppo_multinav.yaml --agent-type oracle-ego --run-type train

The error log is like:

2021-01-16 13:48:18,512 Initializing task MultiNav-v1
Traceback (most recent call last):
  File "habitat_baselines/run.py", line 65, in <module>
    main()
  File "habitat_baselines/run.py", line 18, in main
    run_exp(**vars(args))
  File "habitat_baselines/run.py", line 60, in run_exp
    trainer.train()
  File "/multiON/habitat_baselines/rl/ppo/ppo_trainer.py", line 1154, in train
    self._setup_actor_critic_agent(ppo_cfg)
  File "/multiON/habitat_baselines/rl/ppo/ppo_trainer.py", line 936, in _setup_actor_critic_agent
    self.actor_critic.to(self.device)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 425, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 201, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 201, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 201, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 223, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 423, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: out of memory
Exception ignored in: <bound method VectorEnv.__del__ of <habitat.core.vector_env.VectorEnv object at 0x7f9a0a86b6a0>>
Traceback (most recent call last):
  File "/multiON/habitat/core/vector_env.py", line 469, in __del__
    self.close()
  File "/multiON/habitat/core/vector_env.py", line 351, in close
    write_fn((CLOSE_COMMAND, None))
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

I reduced the NUM_PROCESSES=4 but it not works. And my nvidia-smi is like:

Sat Jan 16 13:55:15 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN V             Off  | 00000000:1A:00.0 Off |                  N/A |
| 28%   38C    P8    25W / 250W |      0MiB / 12066MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN V             Off  | 00000000:1B:00.0 Off |                  N/A |
| 28%   37C    P8    24W / 250W |      0MiB / 12066MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN V             Off  | 00000000:3D:00.0 Off |                  N/A |
| 28%   35C    P8    26W / 250W |      0MiB / 12066MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN V             Off  | 00000000:3E:00.0 Off |                  N/A |
| 28%   37C    P8    25W / 250W |      0MiB / 12066MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

How much cuda memory is required? Thanks in advance for the help.

A Temporal Solution:

I tried set the TORCH_GPU_ID=1 to make the network forwards on a difference GPU device. It looks fine now ;)

saimwani commented 3 years ago

You could also try to distribute the simulators over different GPUs by changing the GPU_DEVICE_ID for each process here. With two 12GB GPUs (one for simulator threads and another for torch), you should be able to train at least 12 workers (NUM_PROCESSES=12) in parallel.

ZhuFengdaaa commented 3 years ago

This suggestion is helpful! I never thought that the simulator could be distributed to different GPUs. Thank you!

saimwani / multiON

RuntimeError: CUDA error: out of memory #3

A Temporal Solution: