RuntimeError inplace operation when training

asolano commented 4 years ago

Hi!

I am trying to run the code on the default configuration and after a while the program results in the following error:

Traceback (most recent call last):
  File "code/main.py", line 73, in <module>
    run()
  File "code/main.py", line 66, in run
    agent.run()
  File "/home/ail/slac.pytorch/code/agent.py", line 116, in run
    self.train_episode()
  File "/home/ail/slac.pytorch/code/agent.py", line 200, in train_episode
    self.learn()
  File "/home/ail/slac.pytorch/code/agent.py", line 228, in learn
    self.learn_sac()
  File "/home/ail/slac.pytorch/code/agent.py", line 275, in learn_sac
    self.policy_optim, self.policy, policy_loss, self.grad_clip)
  File "/home/ail/slac.pytorch/code/utils.py", line 39, in update_params
    loss.backward(retain_graph=retain_graph)
  File "/home/ail/miniconda3/envs/slac/lib/python3.6/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/ail/miniconda3/envs/slac/lib/python3.6/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [256, 1]], which is output 0 of TBackward, is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Is this a result of using a newer version of the dependencies? (the requirements.txt does not have pinned versions)

Thanks.

toshikwa commented 4 years ago

Hi, @asolano

Thank you for sharing. It's an error due to the update of torch>=1.5.0.

I will fix it and refactor codes within a few days. So, please downgrade to torch=<1.4.0 until then.

Thanks :)

toshikwa commented 4 years ago

Hi @asolano

I quickly fixed the issue. Please check if you can run the code.

Thanks :)

asolano commented 4 years ago

Thanks for the quick response, Toshiki.

The new code seemed to run for a while, but then this error happened.

episode: 39    episode steps: 1000  reward: 6.7  
------------------------------------------------------------
Learning the latent model only...
Finish learning the latent model.
------------------------------------------------------------
------------------------------------------------------------
environment steps: 40000  return: 11.4  +/- 3.9  
------------------------------------------------------------
episode: 40    episode steps: 1000  reward: 16.0 
Traceback (most recent call last):
  File "code/main.py", line 69, in <module>
    run()
  File "code/main.py", line 65, in run
    agent.run()
  File "/home/ail/slac.pytorch/code/agent.py", line 115, in run
    self.train_episode()
  File "/home/ail/slac.pytorch/code/agent.py", line 196, in train_episode
    self.learn()
  File "/home/ail/slac.pytorch/code/agent.py", line 224, in learn
    self.learn_sac()
  File "/home/ail/slac.pytorch/code/agent.py", line 292, in learn_sac
    'stats/alpha', self.alpha.detach().item(),
AttributeError: 'float' object has no attribute 'detach'

It seems there is some array/scalar type mismatch, right?

By the way, learning the latent model takes a long time, probably because it is not using the GPUs for those calculations (according to nvidia-smi and htop is all CPU)

toshikwa commented 4 years ago

Hi @asolano

I'm so sorry, it's a bug and I fixed it. Please try it again.

By the way, learning the latent model takes a long time, probably because it is not using the GPUs for those calculations (according to nvidia-smi and htop is all CPU)

Let me assure you if you specify --cuda?? The example command in README seems to use GPU correctly.

$ watch -n 1 nvidia-smi

Every 1.0s: nvidia-smi

Mon Aug 31 09:56:30 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:01:00.0  On |                  N/A |
| 45%   69C    P2    86W / 180W |   1972MiB /  8117MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+

Thanks.

toshikwa commented 4 years ago

If you're using other than CUDA 10.2, you need to reinstall the proper version of PyTorch for CUDA. Please see the instructions for more details.

https://pytorch.org/get-started/locally/

asolano commented 4 years ago

Thanks for the update. The command I used is straight from the README, including the cuda flag:

$ python code/main.py --env_type dm_control --domain_name cheetah --task_name run --action_repeat 4 --seed 0 --cuda

At first it uses the GPU, but then it quickly slows down here:

episode: 28    episode steps: 1000  reward: 9.5  
episode: 29    episode steps: 1000  reward: 16.2 
episode: 30    episode steps: 1000  reward: 8.1  
episode: 31    episode steps: 1000  reward: 5.1  
episode: 32    episode steps: 1000  reward: 15.9 
episode: 33    episode steps: 1000  reward: 9.7  
episode: 34    episode steps: 1000  reward: 13.0 
episode: 35    episode steps: 1000  reward: 15.9 
episode: 36    episode steps: 1000  reward: 8.9  
episode: 37    episode steps: 1000  reward: 18.7 
episode: 38    episode steps: 1000  reward: 16.4 
episode: 39    episode steps: 1000  reward: 13.3 
------------------------------------------------------------
Learning the latent model only...

And this is the result of watching nvidia-smi:

Every 1.0s: nvidia-smi                                                                          Mon Aug 31 10:21:11 2020

Mon Aug 31 10:21:11 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P5000        Off  | 00000000:18:00.0 Off |                  Off |
| 26%   28C    P8     6W / 180W |    336MiB / 16278MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro P5000        Off  | 00000000:3B:00.0 Off |                  Off |
| 26%   27C    P8     6W / 180W |     11MiB / 16278MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Quadro P5000        Off  | 00000000:86:00.0 Off |                  Off |
| 26%   27C    P8     6W / 180W |     11MiB / 16278MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Quadro P5000        Off  | 00000000:AF:00.0 Off |                  Off |
| 26%   28C    P8     6W / 180W |     11MiB / 16278MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     83667      G   python                                         2MiB |
|    0     91916      G   python                                       257MiB |
|    0    274846      G   python                                        65MiB |
+-----------------------------------------------------------------------------+

It seems the part of the code dealing with tha latent model is not done on the GPU on this system.

For confirmation, this is what htop shows:

  1  [|||||||||||||||100.0%]   7  [|||||||||||||||100.0%]   13 [|||||||||||||||100.0%]   19 [|||||||||||||||100.0%]
  2  [|||||||||||||||100.0%]   8  [||||||||||||||||92.7%]   14 [|||||||||||||||100.0%]   20 [|||||||||||||||100.0%]
  3  [|||||||||||||||100.0%]   9  [|||||||||||||||100.0%]   15 [|||||||||||||||100.0%]   21 [|||||||||||||||100.0%]
  4  [|||||||||||||||100.0%]   10 [|||||||||||||||100.0%]   16 [|||||||||||||||100.0%]   22 [|||||||||||||||100.0%]
  5  [||||||||||||||||98.7%]   11 [|||||||||||||||100.0%]   17 [||||||||||||||||99.3%]   23 [|||||||||||||||100.0%]
  6  [|||||||||||||||100.0%]   12 [|||||||||||||||100.0%]   18 [|||||||||||||||100.0%]   24 [|||||||||||||||100.0%]
  Mem[||||||||||||||||||||||||||||||          4.76G/126G]   Tasks: 119, 549 thr; 25 running
  Swp[|||                                     54.0M/977M]   Load average: 28.86 20.72 9.78 
                                                            Uptime: 24 days, 20:33:31

   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command                                                 
274846 ail        20   0 5880M  808M  166M R 2380  0.6  2h28:55 python code/main.py --env_type dm_control --domain_name 
275295 ail        20   0 5880M  808M  166M S 68.4  0.6  3:44.17 python code/main.py --env_type dm_control --domain_name 
275279 ail        20   0 5880M  808M  166M S 67.1  0.6  3:46.76 python code/main.py --env_type dm_control --domain_name 
275293 ail        20   0 5880M  808M  166M S 63.8  0.6  3:47.37 python code/main.py --env_type dm_control --domain_name 
275288 ail        20   0 5880M  808M  166M S 63.8  0.6  3:49.24 python code/main.py --env_type dm_control --domain_name 
275290 ail        20   0 5880M  808M  166M S 63.2  0.6  3:52.18 python code/main.py --env_type dm_control --domain_name 
275284 ail        20   0 5880M  808M  166M S 63.2  0.6  3:51.50 python code/main.py --env_type dm_control --domain_name 
275294 ail        20   0 5880M  808M  166M S 63.2  0.6  3:46.45 python code/main.py --env_type dm_control --domain_name 
275277 ail        20   0 5880M  808M  166M S 63.2  0.6  3:51.07 python code/main.py --env_type dm_control --domain_name 
275276 ail        20   0 5880M  808M  166M S 63.2  0.6  3:51.43 python code/main.py --env_type dm_control --domain_name 
275280 ail        20   0 5880M  808M  166M S 63.2  0.6  3:47.12 python code/main.py --env_type dm_control --domain_name 
275298 ail        20   0 5880M  808M  166M S 63.2  0.6  3:49.06 python code/main.py --env_type dm_control --domain_name 
275286 ail        20   0 5880M  808M  166M S 63.2  0.6  3:46.41 python code/main.py --env_type dm_control --domain_name 
275297 ail        20   0 5880M  808M  166M S 63.2  0.6  3:49.82 python code/main.py --env_type dm_control --domain_name 
275289 ail        20   0 5880M  808M  166M S 62.5  0.6  3:46.80 python code/main.py --env_type dm_control --domain_name 
275283 ail        20   0 5880M  808M  166M S 62.5  0.6  3:49.12 python code/main.py --env_type dm_control --domain_name 
275278 ail        20   0 5880M  808M  166M S 62.5  0.6  3:49.33 python code/main.py --env_type dm_control --domain_name 
275281 ail        20   0 5880M  808M  166M S 62.5  0.6  3:51.67 python code/main.py --env_type dm_control --domain_name 
275287 ail        20   0 5880M  808M  166M S 62.5  0.6  3:47.47 python code/main.py --env_type dm_control --domain_name 
275282 ail        20   0 5880M  808M  166M S 62.5  0.6  3:47.95 python code/main.py --env_type dm_control --domain_name 
275296 ail        20   0 5880M  808M  166M S 62.5  0.6  3:45.06 python code/main.py --env_type dm_control --domain_name 
F1Help  F2Setup F3SearchF4FilterF5Tree  F6SortByF7Nice -F8Nice +F9Kill  F10Quit                                         
[0] 0:bash  1:python  2:python  3:python  4:bash  5:watch- 6:htop*                             "p-gpu-2" 10:24 31-Aug-20

So apparently torch is using all the cores for this?

I will try to update to CUDA 10.2 and see if there is any change.

toshikwa commented 4 years ago

Hi @asolano

I suspect that the issue is due to the invalid combination of CUDA and PyTorch. Please reinstall PyTorch. For example, use the command below if you use CUDA 9.2. (please see instructions)

pip install torch==1.6.0+cu92 -f https://download.pytorch.org/whl/torch_stable.html

asolano commented 4 years ago

Thanks, I first tried updating cuda to 10.2 and pytorch to 1.6 as instructed in the website:

$ conda install pytorch torchvision cudatoolkit=10.2 -c pytorch

but nothing changed. So I recreated the environment from scratch:

$ pip install -r requirements.txt

and that didn't work, either.

So I updated torch and cuda to 10.1 (the same version as the host driver):

$ conda install pytorch torchvision cudatoolkit=10.1 -c pytorch

And now it is using one of the GPUs and the CPUS are almost idle, so that's good. I will let you know once it finishes!

toshikwa commented 4 years ago

Yeah... sometimes nvidia-driver and CUDA configuration doesn't go easily. Anyway, I'm glad that everything seems to have worked out finally!! (Maybe I should have shared Dockerfile I've been using...)

Thanks :)

toshikwa commented 4 years ago

Thanks, I first tried updating cuda to 10.2 and pytorch to 1.6 as instructed in the website:
$ conda install pytorch torchvision cudatoolkit=10.2 -c pytorch
but nothing changed.

Maybe it is because CUDA 10.2 only supports nvidia-driver 440.33 or later. (source)

asolano commented 4 years ago

Sounds reasonable, yes.

So, the default run took more than a day but it finished successfully:

------------------------------------------------------------
environment steps: 2990000  return: 862.7 +/- 10.5 
------------------------------------------------------------
episode: 2990  episode steps: 1000  reward: 867.9
episode: 2991  episode steps: 1000  reward: 872.8
episode: 2992  episode steps: 1000  reward: 862.3
episode: 2993  episode steps: 1000  reward: 746.2
episode: 2994  episode steps: 1000  reward: 848.4
episode: 2995  episode steps: 1000  reward: 853.1
episode: 2996  episode steps: 1000  reward: 871.1
episode: 2997  episode steps: 1000  reward: 866.9
episode: 2998  episode steps: 1000  reward: 874.2
episode: 2999  episode steps: 1000  reward: 817.5
------------------------------------------------------------
environment steps: 3000000  return: 855.3 +/- 24.0 
------------------------------------------------------------
episode: 3000  episode steps: 1000  reward: 883.7
episode: 3001  episode steps: 1000  reward: 859.7

Thanks again for the quick support.

toshikwa commented 4 years ago

So, the default run took more than a day but it finished successfully:

I'm glad to hear that. Training of SLAC takes some time because the inference involves a sequence of forwarding propagation. I have some ideas to speed up training a little bit, however, I don't have enough time to apply them.

Please close the issue if you don't have any other problems. Thanks!!

toshikwa / slac.pytorch

RuntimeError inplace operation when training #1