CUDA_ERROR_NO_DEVICE but partially utilizing GPU

raghu1121 commented 6 years ago

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04 Desktop
Ray installed from (source or binary): binary
Ray version: 0.5.3
Python version: 3.6.6
Exact command to reproduce: python train.py --run DQN --env CartPole-v0 --config '{"num_workers":5,"gpu":true}'
I am trying out DQN on Cartpole env with 5 CPUs and a GPU, but it using 6 CPUS and the GPU is being used only partially. I also get the error tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

I am confused, either it should not use GPU (because of above error) at all or use all the GPU, but it uses only a little as shown below

0 14622 C /home/raghu/anaconda3/envs/ray/bin/python 257MiB

$ nvidia-smi

$ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2017 NVIDIA Corporation Built on Fri_Sep__1_21:08:03_CDT_2017 Cuda compilation tools, release 9.0, V9.0.176

Source code / logs

Output from above command

Please let me know if you need any more logs.

ericl commented 6 years ago

tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

This is normal and is probably printed by the CPU processes (unfortunately I don't think we can suppress that warning)

I am trying out DQN on Cartpole env with 5 CPUs and a GPU, but it using 6 CPUS and the GPU is being used only partially.

The reason you see 6 CPUs used is one is also needed for the GPU process

0 14622 C /home/raghu/anaconda3/envs/ray/bin/python 257MiB

This is because CartPole-v0 isn't very memory intensive. We set tensorflow to only allocate the minimum amount of GPU memory needed so that multiple training runs can share the same GPU (if you set fractional GPU requirements). If you increase the batch size for training this can increase to use all of the memory.

raghu1121 commented 6 years ago

Thanks Eric for reply. If it was only warning, it makes sense to ignore.

I already configured fractional GPU to 1 still it doesn't use much of GPU. I have another custom toy text environment which uses lot of data, it has the same problem, GPU is not being completely utilized. Some other libraries like keras-rl completely utilizes the GPU. Although Rllib is far more efficient and flexible but i couldn't make use of full GPU with Rllib, even with changing batch size

Few other things i wanted to know

1) why do i see '.nan' in the episodereward*. Occasionally it shows the reward values but mostly '.nan'. Why aren't they continuously showing rewards of every episode. 2) How do i force rllib to use Torch instead of Tensorflow? 3) What is the key word to configure number of CPUs in yaml config file.

Thanks for your support.

Regards Raghu

ericl commented 6 years ago

I'm not sure why you want to use more memory, less is better right? Do you see slower performance, it so, can you post a benchmark script?

Re Nan: this will happen if no episodes completed in the iteration. This is fixed in latest I think.

Re CPUs: check out the docs https://ray.readthedocs.io/en/latest/rllib-training.html#specifying-parameters

We only have torch for a3c.

raghu1121 commented 5 years ago

Thanks, Eric. I just wondered if Ray was not using the resources properly. My experience with Keras was that it was using the whole GPU, which i thought was actually necessary but after working with ray and tune, it's clear that ray is efficient and is not resource hungry as Keras.

ray-project / ray

CUDA_ERROR_NO_DEVICE but partially utilizing GPU #3250

System information

Source code / logs