Open Michaelvll opened 1 year ago
Confirmed that it will fix the issue by adding pip install --force-reinstall grpcio==1.51.3
after https://github.com/skypilot-org/skypilot/blob/61cb5e4314b65003421437ca5f90c43bc46dd7d5/sky/templates/gcp-ray.yml.j2#L310
Thus, the issue is related to the problematic grpcio
in conda-forge
.
When a VM has
grpcio==1.51.3
installed in a custom image, it can trigger the following issue when starting ray cluster. After downgrading thegrpcio
to1.51.1
, the ray cluster can be started normally.This issue seems hard to reproduce by manually install thegrpcio==1.51.3
on the VM. We should test if creating a custom image will trigger the issue.Related issue: https://github.com/ray-project/ray/issues/35383 https://github.com/ray-project/ray/issues/34662 It seems the grpcio built-in with conda-forge causes this issue.
To reproduce:
conda install -c conda-forge -y grpcio=1.51.1;
before https://github.com/skypilot-org/skypilot/blob/61cb5e4314b65003421437ca5f90c43bc46dd7d5/sky/templates/gcp-ray.yml.j2#L309sky launch --cloud gcp --cpus 2