Closed roireshef closed 4 years ago
Ray sets CUDA_VISIBLE_DEVICES based on those configs to assign GPUs, so it should be working as you mention.
To clarify, is the report that this restriction is not working as expected on a GPU machine (device being detected despite ray setting the env var)?
@ericl - the condition in this line is wrong https://github.com/ray-project/ray/blob/master/rllib/policy/torch_policy.py#L77
Even if CUDA is available, user doesn't always like to use it. The config param that controls using GPU or not AFAIK is num_gpus (or num_gpus_per_worker). If it is set to 0, there is no reason to move model tensors to cuda device here.
Hmm shouldn't that return false if CUDA_VISIBLE_DEVICES is set to empty string? Ray is setting that to selectively enable GPU for certain actors as requested only.
On Sat, May 30, 2020, 4:08 AM roireshef notifications@github.com wrote:
@ericl https://github.com/ericl - the condition in this line is wrong https://github.com/ray-project/ray/blob/master/rllib/policy/torch_policy.py#L77
Even if CUDA is available, user doesn't always like to use it. The config param that controls using GPU or not AFAIK is num_gpus (or num_gpus_per_worker). If it is set to 0, there is no reason to move model tensors to cuda device here.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/8579#issuecomment-636315553, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSV2ADNJXENCVH3O3DTRUDSL7ANCNFSM4NIZKJWQ .
@ericl - Sorry it took me so much time to respond. Are you suggesting I manually unset an environment variable named CUDA_VISIBLE_DEVICES? Or is it a Tune/RLlib config parameter?
Anyway, I thought it would be more natural to change: https://github.com/ray-project/ray/blob/14405b90d5457863d71168c613b4961d34f19cc5/rllib/policy/torch_policy.py#L80 to:
if torch.cuda.is_available() and config["num_gpus"] > 0 else torch.device("cpu"))
since it didn't make sense to find out a GPU is used if you already set num_gpus to 0.
Also, what happens in the case of IMPALA? was it tested to work? AFAIK in IMPALA the model should be stored in CPU mode (RAM) for the workers, while it is stored in GPU mode for the master/optimizer. There might be a need to condition on both num_gpus and num_gpus_per_worker?
since it didn't make sense to find out a GPU is used if you already set num_gpus to 0.
The reason this is the case is that num_gpus only applies to the learner. The workers are configured separately. Ray will ensure the GPUs are made visible as configured, so cuda.is_available() should be the right check where-ever the policy is created (there are copies of the policy created on both learner and workers).
Basically: always use the RLlib config to assign GPUs to the learner or workers, and Ray will do the right thing.
Basically: always use the RLlib config to assign GPUs to the learner or workers, and Ray will do the right thing.
@ericl - I take it that you are referring to _numgpus and _num_gpus_perworker when you write RLlib config. Well, if yes, then this is not the case today - if you set both of them to 0, you still get the model parameters moved to the GPU since cuda.is_available() returns True, and IMHO this is not the desired behavior. It sounds natural to me that _numgpus and _num_gpus_perworker will override <cuda.is_available() returns True>, since those are the hooks for the user to control GPU usage in the config.
Do you agree?
Are you saying that torch.cuda.is_available()
returns True for you even if both are set to zero? That's very unexpected. To verify, what does this print for you:
@ray.remote(num_gpus=0)
def f():
print(torch.cuda.is_available())
print(os.environ.get("CUDA_VISIBLE_DEVICES"))
ray.init()
ray.get(f.remote())
Are you saying that
torch.cuda.is_available()
returns True for you even if both are set to zero? That's very unexpected. To verify, what does this print for you:@ray.remote(num_gpus=0) def f(): print(torch.cuda.is_available()) print(os.environ.get("CUDA_VISIBLE_DEVICES")) ray.init() ray.get(f.remote())
Short answer is yes. I can re-verify tomorrow with newest commit on 0.9dev, but last time I checked it (about a month ago) _torch.cuda.isavailable() returned True even when I set _numgpus=0
And this makes sense, right? Setting a Ray config param (num_gpus, for instance) shouldn't override PyTorch internals, AFAIK. When you call _torch.cuda.isavialable, torch uses its own internal information, unexposed to Ray config. Or did I miss anything...?
The ray param overrides CUDA_VISIBLE_DEVICES, which should be respected by pytorch (it is respected by tensorflow, etc). There might be some bug where this isn't the case.
This doesn't seem to be the case as of torch 1.4. I'm going to close this; feel free to reopen if you can reproduce otherwise with a recent torch version.
>>> import torch
>>> torch.__version__
'1.4.0'
>>> torch.cuda.is_available()
True
>>> import torch
>>> import os
>>> @ray.remote(num_gpus=0)
... def f():
... print(torch.cuda.is_available())
... print(os.environ.get("CUDA_VISIBLE_DEVICES"))
...
>>>
>>> f.remote()
ObjectID(f66d17bae2b0e765ffffffff010000c001000000)
>>> (pid=61816) False
What is the problem?
in the constructor in /usr/local/lib/python3.6/dist-packages/ray/rllib/policy/torch_policy.py there's casting of torch model to GPU if CUDA is available, regardless of a user's preference not to use GPU. I believe this should be controlled by
num_gpus
andnum_gpus_per_worker
. Current setup ends up with incompatibility when rollout samples are on CPU and model on GPU, etc.Ray version and other system information (Python version, TensorFlow version, OS): Ray version is 0.8.5, Torch 1.3.0
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
Use any environment with PyTorch (I use custom Flow environment + A3C), and set num_gpus to 0.
If we cannot run your script, we cannot fix your issue.