ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.9k stars 5.76k forks source link

Failing on`ray down` #22341

Open mataney opened 2 years ago

mataney commented 2 years ago

Hi, thanks for the great work. I ran a job on a few worker nodes, everything worked well but when I ran ray down it failed because this code. Indeed my worker nodes don't have compute or tpu in their machine name.

Is this a bug?

Thanks again.

mataney commented 2 years ago

Changing my ray version to the previous version it worked for (1.3.0) fixed this Am I doing something wrong?

rkooo567 commented 2 years ago

What's the version of ray that has the issue?

rkooo567 commented 2 years ago

cc @DmitriGekhtman

DmitriGekhtman commented 2 years ago

Hi @mataney, certainly sounds like a bug. Could you share some details, such as

mataney commented 2 years ago

Hi, using Ray 1.10.0

This is the config yaml

cluster_name: gpucluster
max_workers: 50
upscaling_speed: 2.0
idle_timeout_minutes: 10
docker:
   image: "rayproject/ray:latest-gpu"
   container_name: "ray_container"

provider:
    type: gcp
    region: us-west1
    availability_zone: us-west1-b
    project_id: PROJECT_ID
auth:
    ssh_user: ray
available_node_types:
    head_node:
        min_workers: 0
        max_workers: 0
        resources: {"CPU": 16, "GPU": 0}
        node_config:
            machineType: n1-highmem-16
            tags:
              - items: ["allow-all"]
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 100
                  sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu113
            guestAccelerators:
              - acceleratorType: projects/PROJECT_ID/zones/us-west1-b/acceleratorTypes/nvidia-tesla-p100
                acceleratorCount: 1
            metadata:
              items:
                - key: install-nvidia-driver
                  value: "True"
            scheduling:
              - onHostMaintenance: "terminate"
              - automaticRestart: true
    worker_node:
        min_workers: 10
        resources: {"CPU": 4, "GPU": 1}
        node_config:
            machineType: n1-highmem-4
            tags:
              - items: ["allow-all"]
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 100
                  sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu113
            scheduling:
              - preemptible: false
            guestAccelerators:
              - acceleratorType: projects/PROJECT_ID/zones/us-west1-b/acceleratorTypes/nvidia-tesla-p100
                acceleratorCount: 1
            metadata:
              items:
                - key: install-nvidia-driver
                  value: "True"
            scheduling:
              - onHostMaintenance: "terminate"
              - automaticRestart: true

head_node_type: head_node

file_mounts: {
    "/home/ray/A.json": "A.json",
    "/home/ray/B.json": "B.json",
    "/home/ray/C.json": "C.json",
}

cluster_synced_files: []
file_mounts_sync_continuously: False
rsync_exclude:
    - "**/.git"
    - "**/.git/**"
rsync_filter:
    - ".gitignore"
initialization_commands: []
setup_commands:
  - pip3 install ray
head_setup_commands:
  - sudo chown ray ~/ray_bootstrap_key.pem
  - sudo chown ray ~/ray_bootstrap_config.yaml
  - pip3 install google-api-python-client==1.7.8
worker_setup_commands: []
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
      --object-store-memory=1000000000
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      AUTOSCALER_MAX_NUM_FAILURES=9999 ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076
      --object-store-memory=1000000000

I'll try to reproduce it later, but if I recall correctly, it is complaining that 022318ed (see explanation below) is not a valid GCPNodeType.

What happens here?

This is the python code I linked to in the original message:

GCPNodeType(name.split("-")[-1])

Assume the name of the instance is ray-gpucluster-worker-022318ed (but should really be ray-gpucluster-worker-022318ed-compute). Because the instance name is not suffixed with -compute, instead of getting compute as the last item after the name.split, we're getting this 022318ed.

DmitriGekhtman commented 2 years ago

@wuisawesome could you look into reproducing the issue?

stale[bot] commented 2 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

wuisawesome commented 2 years ago

@cadedaniel are you looking into these types of issues right now?

stale[bot] commented 2 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

anyscalesam commented 4 months ago

@mataney is this still occurring for you?