[Train/Tune] Setting `export CUDA_VISIBLE_DEVICES=0` leads to error `ValueError: '0' is not in list`. List of GPUs is made of integers but checks for a string member. #28467

Closed orcunderscore closed 2 years ago

orcunderscore commented 2 years ago

What happened + What you expected to happen

The issue I have occurs with GPUs. I have to run export CUDA_VISIBLE_DEVICES=0 before running my code. I get an error ValueError: '0' is not in list.

I provide a small example how to reproduce this error further below.

I traced this error back to ray's train_loop_utils.TorchWorkerProfile.get_device function: Here, the line gpu_ids = ray.get_gpu_ids() yields a list of strings. However, later down the following code contains a bug (see in code comments):

                gpu_id = gpu_ids[0]  # This is a string

                cuda_visible_str = os.environ.get("CUDA_VISIBLE_DEVICES", "")
                if cuda_visible_str and cuda_visible_str != "NoDevFiles":
                    cuda_visible_list = [
                        int(dev) for dev in cuda_visible_str.split(",")
                    ]  # This is a list of integers
                    device_id = cuda_visible_list.index(gpu_id)  # Looking for the position of a string in an array full of integers --> ValueError: '0' is not in list

Note that the error does not occur without specyfing CUDA_VISIBLE_DEVICES, however then it just picks all GPUs and not the one I specify.

Versions / Dependencies

Conda env yaml

# To create a new environment from this file run : "conda env create --file conda_environment_gpu.yaml".
name: ray_test
  - pytorch
  - conda-forge
  - defaults
  - python=3.9.5
  - cudatoolkit=11.3
  - pip
  - pytorch::pytorch=1.12.1
  - pip:
         - ray==2.0.0
         - pandas
         - pyarrow
         - tabulate
         - torchvision
         - pandas

Reproduction script

Example taken from https://docs.ray.io/en/latest/train/examples/torch_fashion_mnist_example.html and barely adjusted (just removed the argparse). export CUDA_VISIBLE_DEVICES=0

import argparse
from typing import Dict
from ray.air import session

import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

import ray.train as train
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig

# Download training data from open datasets.
training_data = datasets.FashionMNIST(

# Download test data from open datasets.
test_data = datasets.FashionMNIST(

# Define model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28 * 28, 512),
            nn.Linear(512, 512),
            nn.Linear(512, 10),

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

def train_epoch(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset) // session.get_world_size()
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

def validate_epoch(dataloader, model, loss_fn):
    size = len(dataloader.dataset) // session.get_world_size()
    num_batches = len(dataloader)
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
        f"Test Error: \n "
        f"Accuracy: {(100 * correct):>0.1f}%, "
        f"Avg loss: {test_loss:>8f} \n"
    return test_loss

def train_func(config: Dict):
    batch_size = config["batch_size"]
    lr = config["lr"]
    epochs = config["epochs"]

    worker_batch_size = batch_size // session.get_world_size()

    # Create data loaders.
    train_dataloader = DataLoader(training_data, batch_size=worker_batch_size)
    test_dataloader = DataLoader(test_data, batch_size=worker_batch_size)

    train_dataloader = train.torch.prepare_data_loader(train_dataloader)
    test_dataloader = train.torch.prepare_data_loader(test_dataloader)

    # Create model.
    model = NeuralNetwork()
    model = train.torch.prepare_model(model)

    loss_fn = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)

    loss_results = []

    for _ in range(epochs):
        train_epoch(train_dataloader, model, loss_fn, optimizer)
        loss = validate_epoch(test_dataloader, model, loss_fn)

    # return required for backwards compatibility with the old API
    # TODO(team-ml) clean up and remove return
    return loss_results

def train_fashion_mnist(num_workers=2, use_gpu=False):
    trainer = TorchTrainer(
        train_loop_config={"lr": 1e-3, "batch_size": 64, "epochs": 4},
        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
    result = trainer.fit()
    print(f"Results: {result.metrics}")

if __name__ == "__main__":
    train_fashion_mnist(num_workers=1, use_gpu=True)

Issue Severity

High: It blocks me from completing my task.

orcunderscore commented 2 years ago

I changed the title and added explicit dependencies for easier reproducibility. Can someone please confirm if this is a bug or if I am doing something wrong on my end? Thank you!

amogkam commented 2 years ago

Hey @mr-abc-xyz, yes this is a bug! I'm taking a look right now!