ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.98k stars 5.77k forks source link

CUDA_ERROR_NO_DEVICE but partially utilizing GPU #3250

Closed raghu1121 closed 5 years ago

raghu1121 commented 6 years ago

System information

ericl commented 6 years ago

tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

This is normal and is probably printed by the CPU processes (unfortunately I don't think we can suppress that warning)

I am trying out DQN on Cartpole env with 5 CPUs and a GPU, but it using 6 CPUS and the GPU is being used only partially.

The reason you see 6 CPUs used is one is also needed for the GPU process

0 14622 C /home/raghu/anaconda3/envs/ray/bin/python 257MiB

This is because CartPole-v0 isn't very memory intensive. We set tensorflow to only allocate the minimum amount of GPU memory needed so that multiple training runs can share the same GPU (if you set fractional GPU requirements). If you increase the batch size for training this can increase to use all of the memory.

raghu1121 commented 6 years ago

Thanks Eric for reply. If it was only warning, it makes sense to ignore.

I already configured fractional GPU to 1 still it doesn't use much of GPU. I have another custom toy text environment which uses lot of data, it has the same problem, GPU is not being completely utilized. Some other libraries like keras-rl completely utilizes the GPU. Although Rllib is far more efficient and flexible but i couldn't make use of full GPU with Rllib, even with changing batch size

Few other things i wanted to know

1) why do i see '.nan' in the episodereward*. Occasionally it shows the reward values but mostly '.nan'. Why aren't they continuously showing rewards of every episode. 2) How do i force rllib to use Torch instead of Tensorflow? 3) What is the key word to configure number of CPUs in yaml config file.

Thanks for your support.

Regards Raghu

ericl commented 6 years ago

I'm not sure why you want to use more memory, less is better right? Do you see slower performance, it so, can you post a benchmark script?

Re Nan: this will happen if no episodes completed in the iteration. This is fixed in latest I think.

Re CPUs: check out the docs https://ray.readthedocs.io/en/latest/rllib-training.html#specifying-parameters

We only have torch for a3c.

raghu1121 commented 5 years ago

Thanks, Eric. I just wondered if Ray was not using the resources properly. My experience with Keras was that it was using the whole GPU, which i thought was actually necessary but after working with ray and tune, it's clear that ray is efficient and is not resource hungry as Keras.