Closed raghu1121 closed 5 years ago
tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
This is normal and is probably printed by the CPU processes (unfortunately I don't think we can suppress that warning)
I am trying out DQN on Cartpole env with 5 CPUs and a GPU, but it using 6 CPUS and the GPU is being used only partially.
The reason you see 6 CPUs used is one is also needed for the GPU process
0 14622 C /home/raghu/anaconda3/envs/ray/bin/python 257MiB
This is because CartPole-v0 isn't very memory intensive. We set tensorflow to only allocate the minimum amount of GPU memory needed so that multiple training runs can share the same GPU (if you set fractional GPU requirements). If you increase the batch size for training this can increase to use all of the memory.
Thanks Eric for reply. If it was only warning, it makes sense to ignore.
I already configured fractional GPU to 1 still it doesn't use much of GPU. I have another custom toy text environment which uses lot of data, it has the same problem, GPU is not being completely utilized. Some other libraries like keras-rl completely utilizes the GPU. Although Rllib is far more efficient and flexible but i couldn't make use of full GPU with Rllib, even with changing batch size
Few other things i wanted to know
1) why do i see '.nan' in the episodereward*. Occasionally it shows the reward values but mostly '.nan'. Why aren't they continuously showing rewards of every episode. 2) How do i force rllib to use Torch instead of Tensorflow? 3) What is the key word to configure number of CPUs in yaml config file.
Thanks for your support.
Regards Raghu
I'm not sure why you want to use more memory, less is better right? Do you see slower performance, it so, can you post a benchmark script?
Re Nan: this will happen if no episodes completed in the iteration. This is fixed in latest I think.
Re CPUs: check out the docs https://ray.readthedocs.io/en/latest/rllib-training.html#specifying-parameters
We only have torch for a3c.
Thanks, Eric. I just wondered if Ray was not using the resources properly. My experience with Keras was that it was using the whole GPU, which i thought was actually necessary but after working with ray and tune, it's clear that ray is efficient and is not resource hungry as Keras.
System information
I am trying out DQN on Cartpole env with 5 CPUs and a GPU, but it using 6 CPUS and the GPU is being used only partially. I also get the error
tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
I am confused, either it should not use GPU (because of above error) at all or use all the GPU, but it uses only a little as shown below
0 14622 C /home/raghu/anaconda3/envs/ray/bin/python 257MiB
$ nvidia-smi
$ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2017 NVIDIA Corporation Built on Fri_Sep__1_21:08:03_CDT_2017 Cuda compilation tools, release 9.0, V9.0.176
Source code / logs
Output from above command
Please let me know if you need any more logs.