What happened + What you expected to happen

What happened

I ran an experiment with 2 T4 GPUs on GCP using PB2 for 500 iterations. In nearly the middle of the the experiment almost all trials errored out with the following error:

2024-03-09 22:28:49,170 ERROR tune_controller.py:1332 -- Trial task failed for trial PCO_ant_velocity_gym_v1_b9588_00006
Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 1875, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1976, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1881, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1822, in ray._raylet.execute_task.function_executor
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/function_manager.py", line 724, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 533, in __init__
    super().__init__(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 161, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 754, in setup
    self.learner_group = self.config.build_learner_group(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm_config.py", line 1091, in build_learner_group
    learner_group = LearnerGroup(config=self, module_spec=rl_module_spec)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/learner/learner_group.py", line 128, in __init__
    self._learner.build()
  File "/tmp/ray/session_2024-03-09_10-37-04_366606_274/runtime_resources/working_dir_files/_ray_pkg_e1227b7bd77871ca/safe_rllib/algorithms/PCO/PCO_learner.py", line 49, in build
    super().build()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/learner/torch/torch_learner.py", line 313, in build
    assert self._local_gpu_idx < torch.cuda.device_count(), (
AssertionError: local_gpu_idx 0 is not a valid GPU id or is  not available.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 2662, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 866, in get_objects
    raise value
  File "python/ray/_raylet.pyx", line 2273, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 2169, in ray._raylet.execute_task_with_cancellation_handler
  File "python/ray/_raylet.pyx", line 1824, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1825, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 2063, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1100, in ray._raylet.store_task_errors
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, [36mray::PCO.__init__()[39m (pid=157399, ip=10.138.0.54, actor_id=3e26d63a538c5e4d29ab4a6602000000, repr=PCO)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 533, in __init__
    super().__init__(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 161, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 754, in setup
    self.learner_group = self.config.build_learner_group(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm_config.py", line 1091, in build_learner_group
    learner_group = LearnerGroup(config=self, module_spec=rl_module_spec)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/learner/learner_group.py", line 128, in __init__
    self._learner.build()
  File "/tmp/ray/session_2024-03-09_10-37-04_366606_274/runtime_resources/working_dir_files/_ray_pkg_e1227b7bd77871ca/safe_rllib/algorithms/PCO/PCO_learner.py", line 49, in build
    super().build()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/learner/torch/torch_learner.py", line 313, in build
    assert self._local_gpu_idx < torch.cuda.device_count(), (
AssertionError: local_gpu_idx 0 is not a valid GPU id or is  not available.

One after the other runs errored out like this. I attached to the cluster and ran nvidia-smi to check the GPUs, but got

Failed to initialize NVML: Unknown Error

I do not know what to do to diminish the risk of such errors and literally throwing away an experiment that ran for many hours.

What you expected to happen

That trials run through when using the same code for running and being scheduled. I also expected that GPU training should be quite stable with ray.

Versions / Dependencies

Ubuntu 20.0 Python 3.9 Ray Image 2.10.0.d8b3d6-py39-gpu`

Autoscaler for GCP

Reproduction script

Here the autoscaler YAML:

# An unique identifier for the head node and workers of this cluster.
cluster_name: ray-gpu

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 0

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    image: "rayproject/ray-ml:2.10.0.d8b3d6-py39-gpu"
    #image: "rayproject/ray-ml:2.2.0-py39-gpu"
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_nvidia_docker" # e.g. ray_docker

    # # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-gpu"

    # worker_image: "rayproject/ray-ml:latest"

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: gcp
    region: us-west1
    availability_zone: us-west1-a
    project_id: <PROJECT_ID> # Globally unique project id

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below. This requires that you have added the key into the
# project wide meta-data.
#    ssh_private_key: /path/to/your/key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray_head_gpu:
        # The resources provided by this node type.
        resources: {"CPU": 48, "GPU": 2}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-custom-48-319488
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 2000
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/ml-images/global/images/c0-deeplearning-common-cu121-v20231209-debian-11
            # Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
            guestAccelerators:
              - acceleratorType: nvidia-tesla-t4
                acceleratorCount: 2
            metadata:
              items:
                - key: install-nvidia-driver
                  value: "True"
            scheduling:
              - onHostMaintenance: TERMINATE
            serviceAccounts:
              - email: ray-autoscaler-sa-v1@PROJECT_ID.iam.gserviceaccount.com
                scopes:
                - https://www.googleapis.com/auth/cloud-platform

    ray_worker_gpu:
        # The minimum number of nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of workers nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 2, "GPU": 1}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/ml-images/global/images/c0-deeplearning-common-cu121-v20231209-debian-11
            # Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
            guestAccelerators:
              - acceleratorType: nvidia-tesla-t4
                acceleratorCount: 1
            metadata:
              items:
                - key: install-nvidia-driver
                  value: "True"
            # Run workers on preemtible instance by default.
            # Comment this out to use on-demand.
            scheduling:
              - preemptible: true
              - onHostMaintenance: TERMINATE

# Specify the node type of the head node (as configured above).
head_node_type: ray_head_gpu

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

initialization_commands:
    # Wait until nvidia drivers are installed
    - >-
      timeout 300 bash -c "
          command -v nvidia-smi && nvidia-smi
          until [ \$? -eq 0 ]; do
              command -v nvidia-smi && nvidia-smi
          done"
    - gcloud auth configure-docker us-docker.pkg.dev --quiet

# List of shell commands to run to set up nodes.
# NOTE: rayproject/ray-ml:latest has ray latest bundled
setup_commands: 
  - python -m pip install gcsfs

    # - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp36-cp36m-manylinux2014_x86_64.whl
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - pip install google-api-python-client==1.7.8

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
# Make sure to include the dashboard.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
      --include-dashboard=true

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Here the code:

import random

from ray.air.integrations.wandb import WandbLoggerCallback
from ray import train, tune

from ray.rllib.algorithms.ppo.ppo import PPOConfig
from ray.rllib.connectors.env_to_module.mean_std_filter import MeanStdFilter
from ray.rllib.env.single_agent_env_runner import SingleAgentEnvRunner
from ray.tune.schedulers.pb2 import PB2

pb2_scheduler = PB2(
    time_attr="training_iteration",
    metric="episode_reward_mean",
    mode="max",
    perturbation_interval=15,
    # Copy bottom % with top % weights.
    quantile_fraction=0.25,
    hyperparam_bounds={
        "lr": [1e-5, 1e-3],
        "gamma": [0.95, 0.99],
        "lambda": [0.9, 1.0],
        "entropy_coeff": [0.0, 1e-5],
        "clip_param": [0.1, 0.4],
        "vf_loss_coeff": [0.01, 1.0],
    },
)

config = (
    PPOConfig()
    .environment(
        env="CartPole-v1",
    )
    .framework(
        framework="torch",
    )
    .rollouts(
        rollout_fragment_length=10000,
        num_envs_per_worker=1,
        num_rollout_workers=2,
        ignore_worker_failures=True,
        recreate_failed_workers=True,
        env_runner_cls=SingleAgentEnvRunner,
        env_to_module_connector=(lambda env: MeanStdFilter(multi_agent=False)),
    )
    .resources(
        num_gpus_per_learner_worker=0.2,
        num_cpus_for_local_worker=2,
        num_cpus_per_worker=2,
    )
    .experimental(
        _enable_new_api_stack=True,
    )
    .reporting(
        metrics_num_episodes_for_smoothing=50,
    )
    .training(
        lr=tune.sample_from(lambda spec: random.uniform(1e-5, 1e-2)),
        gamma=tune.sample_from(lambda spec: random.uniform(0.95, 0.99)),
        lambda_=tune.sample_from(lambda spec: random.uniform(0.9, 1.0)),
        entropy_coeff=tune.sample_from(lambda spec: random.uniform(0.0, 1e-5)),
        use_kl_loss=True,
        kl_target=0.02,
        clip_param=tune.sample_from(lambda spec: random.uniform(0.1, 0.4)),
        vf_clip_param=float("inf"),
        vf_loss_coeff=tune.sample_from(lambda spec: random.uniform(0.01, 1.0)),
        train_batch_size_per_learner=20000,
        mini_batch_size_per_learner=64,
        num_sgd_iter=10,
        model={
            "fcnet_activation": "tanh",
            "fcnet_hiddens": [64, 64],
        },
    )
    .debugging(
        log_level="DEBUG",
        seed=0
    )
)

if tune.Tuner.can_restore(path="gs://ray-results/"):

    tuner = tune.Tuner.restore(
        path="gs://ray-results-/",
        trainable="PPO",
        param_space=config,
        resume_errored=True,
        resume_unfinished=True,
    )

else:
    tuner = tune.Tuner(
        "PPO",
        param_space=config,
        tune_config=tune.TuneConfig(
            num_samples=10,
            scheduler=pb2_scheduler,
        ),
        run_config=train.RunConfig(
            storage_path="gs://ray-results-2022-09-01/",
            sync_config=train.SyncConfig(
                # Note, Cloud Storage defines the biggest costs.
                # Less frequent syncing might work here to save costs.
                sync_period=600,
            ),
            checkpoint_config=train.CheckpointConfig(
                checkpoint_at_end=True,
                # These settings also intend to save Cloud Storage costs.
                checkpoint_frequency=5,
                num_to_keep=5,
            ),
            stop={"training_iteration": 500},
            name="ppo_with_pb2_hps_search",
        ),
    )

tuner.fit()

Issue Severity

High: It blocks me from completing my task.

ray-project / ray

[Core] - local_gpu_idx 0 is not a valid GPU id or is not available. #43866