Closed orcunderscore closed 2 years ago
I changed the title and added explicit dependencies for easier reproducibility. Can someone please confirm if this is a bug or if I am doing something wrong on my end? Thank you!
Hey @mr-abc-xyz, yes this is a bug! I'm taking a look right now!
What happened + What you expected to happen
The issue I have occurs with GPUs. I have to run
export CUDA_VISIBLE_DEVICES=0
before running my code. I get an errorValueError: '0' is not in list
.I provide a small example how to reproduce this error further below.
I traced this error back to ray's
train_loop_utils.TorchWorkerProfile.get_device
function: Here, the linegpu_ids = ray.get_gpu_ids()
yields a list of strings. However, later down the following code contains a bug (see in code comments):Note that the error does not occur without specyfing
CUDA_VISIBLE_DEVICES
, however then it just picks all GPUs and not the one I specify.Versions / Dependencies
Conda env yaml
conda list
pip list
Reproduction script
Example taken from https://docs.ray.io/en/latest/train/examples/torch_fashion_mnist_example.html and barely adjusted (just removed the argparse).
export CUDA_VISIBLE_DEVICES=0
Issue Severity
High: It blocks me from completing my task.