ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.11k stars 5.6k forks source link

User has no control over TorchPolicy migrating model to GPU [rllib] #8579

Closed roireshef closed 4 years ago

roireshef commented 4 years ago

What is the problem?

in the constructor in /usr/local/lib/python3.6/dist-packages/ray/rllib/policy/torch_policy.py there's casting of torch model to GPU if CUDA is available, regardless of a user's preference not to use GPU. I believe this should be controlled by num_gpus and num_gpus_per_worker. Current setup ends up with incompatibility when rollout samples are on CPU and model on GPU, etc.

Ray version and other system information (Python version, TensorFlow version, OS): Ray version is 0.8.5, Torch 1.3.0

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

Use any environment with PyTorch (I use custom Flow environment + A3C), and set num_gpus to 0.

If we cannot run your script, we cannot fix your issue.

ericl commented 4 years ago

Ray sets CUDA_VISIBLE_DEVICES based on those configs to assign GPUs, so it should be working as you mention.

To clarify, is the report that this restriction is not working as expected on a GPU machine (device being detected despite ray setting the env var)?

roireshef commented 4 years ago

@ericl - the condition in this line is wrong https://github.com/ray-project/ray/blob/master/rllib/policy/torch_policy.py#L77

Even if CUDA is available, user doesn't always like to use it. The config param that controls using GPU or not AFAIK is num_gpus (or num_gpus_per_worker). If it is set to 0, there is no reason to move model tensors to cuda device here.

ericl commented 4 years ago

Hmm shouldn't that return false if CUDA_VISIBLE_DEVICES is set to empty string? Ray is setting that to selectively enable GPU for certain actors as requested only.

On Sat, May 30, 2020, 4:08 AM roireshef notifications@github.com wrote:

@ericl https://github.com/ericl - the condition in this line is wrong https://github.com/ray-project/ray/blob/master/rllib/policy/torch_policy.py#L77

Even if CUDA is available, user doesn't always like to use it. The config param that controls using GPU or not AFAIK is num_gpus (or num_gpus_per_worker). If it is set to 0, there is no reason to move model tensors to cuda device here.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/8579#issuecomment-636315553, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSV2ADNJXENCVH3O3DTRUDSL7ANCNFSM4NIZKJWQ .

roireshef commented 4 years ago

@ericl - Sorry it took me so much time to respond. Are you suggesting I manually unset an environment variable named CUDA_VISIBLE_DEVICES? Or is it a Tune/RLlib config parameter?

Anyway, I thought it would be more natural to change: https://github.com/ray-project/ray/blob/14405b90d5457863d71168c613b4961d34f19cc5/rllib/policy/torch_policy.py#L80 to:

if torch.cuda.is_available() and config["num_gpus"] > 0 else torch.device("cpu"))

since it didn't make sense to find out a GPU is used if you already set num_gpus to 0.

Also, what happens in the case of IMPALA? was it tested to work? AFAIK in IMPALA the model should be stored in CPU mode (RAM) for the workers, while it is stored in GPU mode for the master/optimizer. There might be a need to condition on both num_gpus and num_gpus_per_worker?

ericl commented 4 years ago

since it didn't make sense to find out a GPU is used if you already set num_gpus to 0.

The reason this is the case is that num_gpus only applies to the learner. The workers are configured separately. Ray will ensure the GPUs are made visible as configured, so cuda.is_available() should be the right check where-ever the policy is created (there are copies of the policy created on both learner and workers).

Basically: always use the RLlib config to assign GPUs to the learner or workers, and Ray will do the right thing.

roireshef commented 4 years ago

Basically: always use the RLlib config to assign GPUs to the learner or workers, and Ray will do the right thing.

@ericl - I take it that you are referring to _numgpus and _num_gpus_perworker when you write RLlib config. Well, if yes, then this is not the case today - if you set both of them to 0, you still get the model parameters moved to the GPU since cuda.is_available() returns True, and IMHO this is not the desired behavior. It sounds natural to me that _numgpus and _num_gpus_perworker will override <cuda.is_available() returns True>, since those are the hooks for the user to control GPU usage in the config.

Do you agree?

ericl commented 4 years ago

Are you saying that torch.cuda.is_available() returns True for you even if both are set to zero? That's very unexpected. To verify, what does this print for you:

@ray.remote(num_gpus=0)
def f():
   print(torch.cuda.is_available())
   print(os.environ.get("CUDA_VISIBLE_DEVICES"))

ray.init()
ray.get(f.remote())
roireshef commented 4 years ago

Are you saying that torch.cuda.is_available() returns True for you even if both are set to zero? That's very unexpected. To verify, what does this print for you:

@ray.remote(num_gpus=0)
def f():
   print(torch.cuda.is_available())
   print(os.environ.get("CUDA_VISIBLE_DEVICES"))

ray.init()
ray.get(f.remote())

Short answer is yes. I can re-verify tomorrow with newest commit on 0.9dev, but last time I checked it (about a month ago) _torch.cuda.isavailable() returned True even when I set _numgpus=0

And this makes sense, right? Setting a Ray config param (num_gpus, for instance) shouldn't override PyTorch internals, AFAIK. When you call _torch.cuda.isavialable, torch uses its own internal information, unexposed to Ray config. Or did I miss anything...?

ericl commented 4 years ago

The ray param overrides CUDA_VISIBLE_DEVICES, which should be respected by pytorch (it is respected by tensorflow, etc). There might be some bug where this isn't the case.

ericl commented 4 years ago

This doesn't seem to be the case as of torch 1.4. I'm going to close this; feel free to reopen if you can reproduce otherwise with a recent torch version.

>>> import torch
>>> torch.__version__
'1.4.0'
>>> torch.cuda.is_available()
True
>>> import torch
>>> import os
>>> @ray.remote(num_gpus=0)
... def f():
...    print(torch.cuda.is_available())
...    print(os.environ.get("CUDA_VISIBLE_DEVICES"))
... 
>>> 
>>> f.remote()
ObjectID(f66d17bae2b0e765ffffffff010000c001000000)
>>> (pid=61816) False