Open mataney opened 2 years ago
Changing my ray version to the previous version it worked for (1.3.0
) fixed this
Am I doing something wrong?
What's the version of ray that has the issue?
cc @DmitriGekhtman
Hi @mataney, certainly sounds like a bug. Could you share some details, such as
Hi, using Ray 1.10.0
This is the config yaml
cluster_name: gpucluster
max_workers: 50
upscaling_speed: 2.0
idle_timeout_minutes: 10
docker:
image: "rayproject/ray:latest-gpu"
container_name: "ray_container"
provider:
type: gcp
region: us-west1
availability_zone: us-west1-b
project_id: PROJECT_ID
auth:
ssh_user: ray
available_node_types:
head_node:
min_workers: 0
max_workers: 0
resources: {"CPU": 16, "GPU": 0}
node_config:
machineType: n1-highmem-16
tags:
- items: ["allow-all"]
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 100
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu113
guestAccelerators:
- acceleratorType: projects/PROJECT_ID/zones/us-west1-b/acceleratorTypes/nvidia-tesla-p100
acceleratorCount: 1
metadata:
items:
- key: install-nvidia-driver
value: "True"
scheduling:
- onHostMaintenance: "terminate"
- automaticRestart: true
worker_node:
min_workers: 10
resources: {"CPU": 4, "GPU": 1}
node_config:
machineType: n1-highmem-4
tags:
- items: ["allow-all"]
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 100
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu113
scheduling:
- preemptible: false
guestAccelerators:
- acceleratorType: projects/PROJECT_ID/zones/us-west1-b/acceleratorTypes/nvidia-tesla-p100
acceleratorCount: 1
metadata:
items:
- key: install-nvidia-driver
value: "True"
scheduling:
- onHostMaintenance: "terminate"
- automaticRestart: true
head_node_type: head_node
file_mounts: {
"/home/ray/A.json": "A.json",
"/home/ray/B.json": "B.json",
"/home/ray/C.json": "C.json",
}
cluster_synced_files: []
file_mounts_sync_continuously: False
rsync_exclude:
- "**/.git"
- "**/.git/**"
rsync_filter:
- ".gitignore"
initialization_commands: []
setup_commands:
- pip3 install ray
head_setup_commands:
- sudo chown ray ~/ray_bootstrap_key.pem
- sudo chown ray ~/ray_bootstrap_config.yaml
- pip3 install google-api-python-client==1.7.8
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- >-
ulimit -n 65536;
ray start
--head
--port=6379
--object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
--object-store-memory=1000000000
worker_start_ray_commands:
- ray stop
- >-
ulimit -n 65536;
AUTOSCALER_MAX_NUM_FAILURES=9999 ray start
--address=$RAY_HEAD_IP:6379
--object-manager-port=8076
--object-store-memory=1000000000
I'll try to reproduce it later, but if I recall correctly, it is complaining that 022318ed
(see explanation below) is not a valid GCPNodeType.
What happens here?
This is the python code I linked to in the original message:
GCPNodeType(name.split("-")[-1])
Assume the name of the instance is ray-gpucluster-worker-022318ed
(but should really be ray-gpucluster-worker-022318ed-compute
). Because the instance name is not suffixed with -compute
, instead of getting compute
as the last item after the name.split
, we're getting this 022318ed
.
@wuisawesome could you look into reproducing the issue?
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
@cadedaniel are you looking into these types of issues right now?
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
@mataney is this still occurring for you?
Hi, thanks for the great work. I ran a job on a few worker nodes, everything worked well but when I ran
ray down
it failed because this code. Indeed my worker nodes don't havecompute
ortpu
in their machine name.Is this a bug?
Thanks again.