ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.5k stars 5.69k forks source link

[VM launcher] Ran `Ray status` after I sshed in to the head node and it printed "No cluster status" #35017

Open scottsun94 opened 1 year ago

scottsun94 commented 1 year ago

What happened + What you expected to happen

The head node is started.

Local node IP: 172.31.62.187

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='172.31.62.187:6379'

  To connect to this Ray cluster:
    import ray
    ray.init()

  To submit a Ray job using the Ray Jobs CLI:
    RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py

  See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
  for more information on submitting Ray jobs to the Ray cluster.

  To terminate the Ray runtime, run
    ray stop

  To view the status of the cluster, use
    ray status

  To monitor and debug Ray, view the dashboard at
    127.0.0.1:8265

  If connection to the dashboard fails, check your firewall settings and network configuration.
Shared connection to 34.223.114.236 closed.
  New status: up-to-date

I ran Ray status after I sshed in to the head node and it printed "No cluster status".

Last login: Wed May  3 21:47:54 2023 from {my laptop ip}
ubuntu@ip-172-31-62-187:~$ ray status
No cluster status.
ubuntu@ip-172-31-62-187:~$ exit

The yaml file is attached below.

cluster_name: 0503-3
max_workers: 2
provider:
    type: aws
    region: us-west-2
    cache_stopped_nodes: True
auth:
    ssh_user: ubuntu
available_node_types:
    ray.head.default:
        node_config:
            InstanceType: m5.2xlarge
    ray.worker.default:
        min_workers: 2
        max_workers: 2
        node_config:
            InstanceType: m5.2xlarge
head_node_type: ray.head.default
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --temp-dir=~/ray_temp_logs/
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Versions / Dependencies

see yaml file above

Reproduction script

see yaml file above

Issue Severity

High: It blocks me from completing my task.

scottsun94 commented 1 year ago

cc: @gvspraveen @wuisawesome

wuisawesome commented 1 year ago

In the absence of error messages i'm assuming this is a race condition where that ray status is happening before the autoscaler is fully up.

I assume this should get fixed in the autoscaler refactor? @scv119