ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.89k stars 5.57k forks source link

[Ray Cluster on Google Cloud] Example YAML script breaks with A100 GPU type #44308

Open jaanphare opened 5 months ago

jaanphare commented 5 months ago

What happened + What you expected to happen

  1. While the example GPU + docker script works (https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/gcp/example-gpu-docker.yaml), a slight modification does not.

  2. The specific modifications to the script:

    • changing the project_id
    • changing the docker image to a newer one: image: "rayproject/ray:latest-py310-cu121" (from the docker hub: https://hub.docker.com/r/rayproject/ray/tags)
    • changing the google cloud compute image to a newer one, from the output of this command:
      gcloud compute images list \                        
      --project deeplearning-platform-release \
      --format="value(NAME)" \
      --no-standard-images | grep pytorch

      This should work I think (it works fine on AWS).

  3. Error message:


/usr/bin/nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

/usr/bin/nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

/usr/bin/nvidia-smi
Tue Mar 26 21:04:33 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0              50W / 400W |      4MiB / 40960MiB |     27%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
Connection to 34.83.158.35 closed.
  [5/7] Initializing command runner
Warning: Permanently added '34.83.158.35' (ED25519) to the list of known hosts.
Shared connection to 34.83.158.35 closed.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/images/create?fromImage=rayproject%2Fray&tag=latest-py310-cu121": dial unix /var/run/docker.sock: connect: permission denied
Shared connection to 34.83.158.35 closed.
2024-03-26 17:04:40,002 INFO node.py:337 -- wait_for_compute_zone_operation: Waiting for operation operation-1711487079713-61496a36c89a5-13f0ab4d-1a1fe1ad to finish...
2024-03-26 17:04:45,324 INFO node.py:356 -- wait_for_compute_zone_operation: Operation operation-1711487079713-61496a36c89a5-13f0ab4d-1a1fe1ad finished.
  New status: update-failed
  !!!
  Full traceback: Traceback (most recent call last):
  File "/Users/me/projects/.venv/lib/python3.11/site-packages/ray/autoscaler/_private/updater.py", line 159, in run
    self.do_update()
  File "/Users/me/projects/.venv/lib/python3.11/site-packages/ray/autoscaler/_private/updater.py", line 451, in do_update
    self.cmd_runner.run_init(
  File "/Users/me/projects/.venv/lib/python3.11/site-packages/ray/autoscaler/_private/command_runner.py", line 722, in run_init
    self.run(
  File "/Users/me/projects/.venv/lib/python3.11/site-packages/ray/autoscaler/_private/command_runner.py", line 493, in run
    return self.ssh_command_runner.run(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/me/projects/.venv/lib/python3.11/site-packages/ray/autoscaler/_private/command_runner.py", line 383, in run
    return self._run_helper(final_cmd, with_output, exit_on_fail, silent=silent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/me/projects/.venv/lib/python3.11/site-packages/ray/autoscaler/_private/command_runner.py", line 298, in _run_helper
    raise click.ClickException(fail_msg) from None
click.exceptions.ClickException: SSH command failed.

  Error message: SSH command failed.
  !!!

  Failed to setup head node.

Versions / Dependencies

Python 3.10 and CUDA 12.1.

Reproduction script

YAML cluster configuration modified from https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/gcp/example-gpu-docker.yaml -

# From https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/gcp/example-gpu-docker.yaml

# An unique identifier for the head node and workers of this cluster.
cluster_name: gpu-docker

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    image: "rayproject/ray:latest-py310-cu121"
    # image: rayproject/ray-ml:latest-gpu   # use this one if you need ML dependencies, but it's slower to pull
    container_name: "ray_nvidia_docker" # e.g. ray_docker

    # # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-gpu"

    # worker_image: "rayproject/ray-ml:latest"

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: gcp
    region: us-west1
    availability_zone: us-west1-b
    project_id: null # Replace this with your globally unique project id

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below. This requires that you have added the key into the
# project wide meta-data.
#    ssh_private_key: /path/to/your/key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray_head_gpu:
        # The resources provided by this node type.
        resources: {"CPU": 12, "GPU": 1}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: a2-highgpu-1g
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 140
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/deeplearning-platform-release/global/images/pytorch-latest-cu121-v20240319-ubuntu-2004-py310
            # Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
            # guestAccelerators:
            #   - acceleratorType: nvidia-tesla-t4
            #     acceleratorCount: 1
            metadata:
              items:
                - key: install-nvidia-driver
                  value: "True"
            scheduling:
              - onHostMaintenance: TERMINATE

    ray_worker_gpu:
        # The minimum number of nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of workers nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 12, "GPU": 1}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: a2-highgpu-1g
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 140
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/deeplearning-platform-release/global/images/pytorch-latest-cu121-v20240319-ubuntu-2004-py310
            # Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
            # guestAccelerators:
            #   - acceleratorType: nvidia-tesla-t4
            #     acceleratorCount: 1
            metadata:
              items:
                - key: install-nvidia-driver
                  value: "True"
            # Run workers on preemtible instance by default.
            # Comment this out to use on-demand.
            scheduling:
              - preemptible: true
              - onHostMaintenance: TERMINATE

# Specify the node type of the head node (as configured above).
head_node_type: ray_head_gpu

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

initialization_commands:
    # Wait until nvidia drivers are installed
    - >-
      timeout 300 bash -c "
          command -v nvidia-smi && nvidia-smi
          until [ \$? -eq 0 ]; do
              command -v nvidia-smi && nvidia-smi
          done"

# List of shell commands to run to set up nodes.
# NOTE: rayproject/ray-ml:latest has ray latest bundled
setup_commands: []
    # - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - pip install google-api-python-client==1.7.8

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Issue Severity

High: It blocks me from completing my task.

anyscalesam commented 4 months ago

more context in https://groups.google.com/g/google-dl-platform/c/cYvzQuM5JY0 and https://ray-distributed.slack.com/archives/CN2RGCHRR/p1711487521436029 ... @aslonnie is there anything we can do to make sure the next batch of ray oss images with cuda has the dependency correctly configured?

aslonnie commented 4 months ago

@anyscalesam , the error seems to be permission denied on docker daemon service access. does not seem to be related to cuda

aslonnie commented 4 months ago

this is the oss cluster launcher, should route to core team. @jjyao