Closed asolano closed 4 years ago
Hi, @asolano
Thank you for sharing.
It's an error due to the update of torch>=1.5.0
.
I will fix it and refactor codes within a few days.
So, please downgrade to torch=<1.4.0
until then.
Thanks :)
Hi @asolano
I quickly fixed the issue. Please check if you can run the code.
Thanks :)
Thanks for the quick response, Toshiki.
The new code seemed to run for a while, but then this error happened.
episode: 39 episode steps: 1000 reward: 6.7
------------------------------------------------------------
Learning the latent model only...
Finish learning the latent model.
------------------------------------------------------------
------------------------------------------------------------
environment steps: 40000 return: 11.4 +/- 3.9
------------------------------------------------------------
episode: 40 episode steps: 1000 reward: 16.0
Traceback (most recent call last):
File "code/main.py", line 69, in <module>
run()
File "code/main.py", line 65, in run
agent.run()
File "/home/ail/slac.pytorch/code/agent.py", line 115, in run
self.train_episode()
File "/home/ail/slac.pytorch/code/agent.py", line 196, in train_episode
self.learn()
File "/home/ail/slac.pytorch/code/agent.py", line 224, in learn
self.learn_sac()
File "/home/ail/slac.pytorch/code/agent.py", line 292, in learn_sac
'stats/alpha', self.alpha.detach().item(),
AttributeError: 'float' object has no attribute 'detach'
It seems there is some array/scalar type mismatch, right?
By the way, learning the latent model takes a long time, probably because it is not using the GPUs for those calculations (according to nvidia-smi and htop is all CPU)
Hi @asolano
I'm so sorry, it's a bug and I fixed it. Please try it again.
By the way, learning the latent model takes a long time, probably because it is not using the GPUs for those calculations (according to nvidia-smi and htop is all CPU)
Let me assure you if you specify --cuda
??
The example command in README seems to use GPU correctly.
$ watch -n 1 nvidia-smi
Every 1.0s: nvidia-smi
Mon Aug 31 09:56:30 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:01:00.0 On | N/A |
| 45% 69C P2 86W / 180W | 1972MiB / 8117MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
Thanks.
If you're using other than CUDA 10.2, you need to reinstall the proper version of PyTorch for CUDA. Please see the instructions for more details.
Thanks for the update. The command I used is straight from the README, including the cuda flag:
$ python code/main.py --env_type dm_control --domain_name cheetah --task_name run --action_repeat 4 --seed 0 --cuda
At first it uses the GPU, but then it quickly slows down here:
episode: 28 episode steps: 1000 reward: 9.5
episode: 29 episode steps: 1000 reward: 16.2
episode: 30 episode steps: 1000 reward: 8.1
episode: 31 episode steps: 1000 reward: 5.1
episode: 32 episode steps: 1000 reward: 15.9
episode: 33 episode steps: 1000 reward: 9.7
episode: 34 episode steps: 1000 reward: 13.0
episode: 35 episode steps: 1000 reward: 15.9
episode: 36 episode steps: 1000 reward: 8.9
episode: 37 episode steps: 1000 reward: 18.7
episode: 38 episode steps: 1000 reward: 16.4
episode: 39 episode steps: 1000 reward: 13.3
------------------------------------------------------------
Learning the latent model only...
And this is the result of watching nvidia-smi:
Every 1.0s: nvidia-smi Mon Aug 31 10:21:11 2020
Mon Aug 31 10:21:11 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P5000 Off | 00000000:18:00.0 Off | Off |
| 26% 28C P8 6W / 180W | 336MiB / 16278MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Quadro P5000 Off | 00000000:3B:00.0 Off | Off |
| 26% 27C P8 6W / 180W | 11MiB / 16278MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Quadro P5000 Off | 00000000:86:00.0 Off | Off |
| 26% 27C P8 6W / 180W | 11MiB / 16278MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Quadro P5000 Off | 00000000:AF:00.0 Off | Off |
| 26% 28C P8 6W / 180W | 11MiB / 16278MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 83667 G python 2MiB |
| 0 91916 G python 257MiB |
| 0 274846 G python 65MiB |
+-----------------------------------------------------------------------------+
It seems the part of the code dealing with tha latent model is not done on the GPU on this system.
For confirmation, this is what htop shows:
1 [|||||||||||||||100.0%] 7 [|||||||||||||||100.0%] 13 [|||||||||||||||100.0%] 19 [|||||||||||||||100.0%]
2 [|||||||||||||||100.0%] 8 [||||||||||||||||92.7%] 14 [|||||||||||||||100.0%] 20 [|||||||||||||||100.0%]
3 [|||||||||||||||100.0%] 9 [|||||||||||||||100.0%] 15 [|||||||||||||||100.0%] 21 [|||||||||||||||100.0%]
4 [|||||||||||||||100.0%] 10 [|||||||||||||||100.0%] 16 [|||||||||||||||100.0%] 22 [|||||||||||||||100.0%]
5 [||||||||||||||||98.7%] 11 [|||||||||||||||100.0%] 17 [||||||||||||||||99.3%] 23 [|||||||||||||||100.0%]
6 [|||||||||||||||100.0%] 12 [|||||||||||||||100.0%] 18 [|||||||||||||||100.0%] 24 [|||||||||||||||100.0%]
Mem[|||||||||||||||||||||||||||||| 4.76G/126G] Tasks: 119, 549 thr; 25 running
Swp[||| 54.0M/977M] Load average: 28.86 20.72 9.78
Uptime: 24 days, 20:33:31
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
274846 ail 20 0 5880M 808M 166M R 2380 0.6 2h28:55 python code/main.py --env_type dm_control --domain_name
275295 ail 20 0 5880M 808M 166M S 68.4 0.6 3:44.17 python code/main.py --env_type dm_control --domain_name
275279 ail 20 0 5880M 808M 166M S 67.1 0.6 3:46.76 python code/main.py --env_type dm_control --domain_name
275293 ail 20 0 5880M 808M 166M S 63.8 0.6 3:47.37 python code/main.py --env_type dm_control --domain_name
275288 ail 20 0 5880M 808M 166M S 63.8 0.6 3:49.24 python code/main.py --env_type dm_control --domain_name
275290 ail 20 0 5880M 808M 166M S 63.2 0.6 3:52.18 python code/main.py --env_type dm_control --domain_name
275284 ail 20 0 5880M 808M 166M S 63.2 0.6 3:51.50 python code/main.py --env_type dm_control --domain_name
275294 ail 20 0 5880M 808M 166M S 63.2 0.6 3:46.45 python code/main.py --env_type dm_control --domain_name
275277 ail 20 0 5880M 808M 166M S 63.2 0.6 3:51.07 python code/main.py --env_type dm_control --domain_name
275276 ail 20 0 5880M 808M 166M S 63.2 0.6 3:51.43 python code/main.py --env_type dm_control --domain_name
275280 ail 20 0 5880M 808M 166M S 63.2 0.6 3:47.12 python code/main.py --env_type dm_control --domain_name
275298 ail 20 0 5880M 808M 166M S 63.2 0.6 3:49.06 python code/main.py --env_type dm_control --domain_name
275286 ail 20 0 5880M 808M 166M S 63.2 0.6 3:46.41 python code/main.py --env_type dm_control --domain_name
275297 ail 20 0 5880M 808M 166M S 63.2 0.6 3:49.82 python code/main.py --env_type dm_control --domain_name
275289 ail 20 0 5880M 808M 166M S 62.5 0.6 3:46.80 python code/main.py --env_type dm_control --domain_name
275283 ail 20 0 5880M 808M 166M S 62.5 0.6 3:49.12 python code/main.py --env_type dm_control --domain_name
275278 ail 20 0 5880M 808M 166M S 62.5 0.6 3:49.33 python code/main.py --env_type dm_control --domain_name
275281 ail 20 0 5880M 808M 166M S 62.5 0.6 3:51.67 python code/main.py --env_type dm_control --domain_name
275287 ail 20 0 5880M 808M 166M S 62.5 0.6 3:47.47 python code/main.py --env_type dm_control --domain_name
275282 ail 20 0 5880M 808M 166M S 62.5 0.6 3:47.95 python code/main.py --env_type dm_control --domain_name
275296 ail 20 0 5880M 808M 166M S 62.5 0.6 3:45.06 python code/main.py --env_type dm_control --domain_name
F1Help F2Setup F3SearchF4FilterF5Tree F6SortByF7Nice -F8Nice +F9Kill F10Quit
[0] 0:bash 1:python 2:python 3:python 4:bash 5:watch- 6:htop* "p-gpu-2" 10:24 31-Aug-20
So apparently torch is using all the cores for this?
I will try to update to CUDA 10.2 and see if there is any change.
Hi @asolano
I suspect that the issue is due to the invalid combination of CUDA and PyTorch. Please reinstall PyTorch. For example, use the command below if you use CUDA 9.2. (please see instructions)
pip install torch==1.6.0+cu92 -f https://download.pytorch.org/whl/torch_stable.html
Thanks, I first tried updating cuda to 10.2 and pytorch to 1.6 as instructed in the website:
$ conda install pytorch torchvision cudatoolkit=10.2 -c pytorch
but nothing changed. So I recreated the environment from scratch:
$ pip install -r requirements.txt
and that didn't work, either.
So I updated torch and cuda to 10.1 (the same version as the host driver):
$ conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
And now it is using one of the GPUs and the CPUS are almost idle, so that's good. I will let you know once it finishes!
Yeah... sometimes nvidia-driver and CUDA configuration doesn't go easily. Anyway, I'm glad that everything seems to have worked out finally!! (Maybe I should have shared Dockerfile I've been using...)
Thanks :)
Thanks, I first tried updating cuda to 10.2 and pytorch to 1.6 as instructed in the website:
$ conda install pytorch torchvision cudatoolkit=10.2 -c pytorch
but nothing changed.
Maybe it is because CUDA 10.2 only supports nvidia-driver 440.33 or later. (source)
Sounds reasonable, yes.
So, the default run took more than a day but it finished successfully:
------------------------------------------------------------
environment steps: 2990000 return: 862.7 +/- 10.5
------------------------------------------------------------
episode: 2990 episode steps: 1000 reward: 867.9
episode: 2991 episode steps: 1000 reward: 872.8
episode: 2992 episode steps: 1000 reward: 862.3
episode: 2993 episode steps: 1000 reward: 746.2
episode: 2994 episode steps: 1000 reward: 848.4
episode: 2995 episode steps: 1000 reward: 853.1
episode: 2996 episode steps: 1000 reward: 871.1
episode: 2997 episode steps: 1000 reward: 866.9
episode: 2998 episode steps: 1000 reward: 874.2
episode: 2999 episode steps: 1000 reward: 817.5
------------------------------------------------------------
environment steps: 3000000 return: 855.3 +/- 24.0
------------------------------------------------------------
episode: 3000 episode steps: 1000 reward: 883.7
episode: 3001 episode steps: 1000 reward: 859.7
Thanks again for the quick support.
So, the default run took more than a day but it finished successfully:
I'm glad to hear that. Training of SLAC takes some time because the inference involves a sequence of forwarding propagation. I have some ideas to speed up training a little bit, however, I don't have enough time to apply them.
Please close the issue if you don't have any other problems. Thanks!!
Hi!
I am trying to run the code on the default configuration and after a while the program results in the following error:
Is this a result of using a newer version of the dependencies? (the requirements.txt does not have pinned versions)
Thanks.