skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.71k stars 496 forks source link

[Core] `grpcio` from conda-forge on remote VM can cause failure in starting ray cluster #2605

Open Michaelvll opened 1 year ago

Michaelvll commented 1 year ago

When a VM has grpcio==1.51.3 installed in a custom image, it can trigger the following issue when starting ray cluster. After downgrading the grpcio to 1.51.1, the ray cluster can be started normally.

$ ray stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 ray start --disable-usage-stats --head --port=6380 -
-dashboard-port=8266 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml  --temp-dir /tmp/ray_skypilot/
Stopped all 3 Ray processes.
Usage stats collection is disabled.

Local node IP: 10.128.0.17
2023-09-25 16:11:38,229 ERROR services.py:1197 -- Failed to start the dashboard , return code -11
2023-09-25 16:11:38,230 ERROR services.py:1222 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is.
2023-09-25 16:11:38,230 ERROR services.py:1266 -- 
The last 20 lines of /tmp/ray_skypilot/session_2023-09-25_16-11-35_935010_2994/logs/dashboard.log (it contains the error message from the dashboard): 
2023-09-25 16:11:38,077 INFO head.py:239 -- Starting dashboard metrics server on port 44227

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='10.128.0.17:6380'

  To connect to this Ray cluster:
    import ray
    ray.init()

  To terminate the Ray runtime, run
    ray stop

  To view the status of the cluster, use
    ray status

This issue seems hard to reproduce by manually install the grpcio==1.51.3 on the VM. We should test if creating a custom image will trigger the issue.

Related issue: https://github.com/ray-project/ray/issues/35383 https://github.com/ray-project/ray/issues/34662 It seems the grpcio built-in with conda-forge causes this issue.

To reproduce:

  1. Add conda install -c conda-forge -y grpcio=1.51.1; before https://github.com/skypilot-org/skypilot/blob/61cb5e4314b65003421437ca5f90c43bc46dd7d5/sky/templates/gcp-ray.yml.j2#L309
  2. sky launch --cloud gcp --cpus 2
Michaelvll commented 1 year ago

Confirmed that it will fix the issue by adding pip install --force-reinstall grpcio==1.51.3 after https://github.com/skypilot-org/skypilot/blob/61cb5e4314b65003421437ca5f90c43bc46dd7d5/sky/templates/gcp-ray.yml.j2#L310

Thus, the issue is related to the problematic grpcio in conda-forge.