ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.23k stars 5.81k forks source link

Version mismatch error #44063

Open mforhad opened 8 months ago

mforhad commented 8 months ago

What happened + What you expected to happen

I am having version dependencies as follows, RuntimeError: Version mismatch: The cluster was started with: Ray: 2.9.3 Python: 3.10.13 This process on node xx.xxx.x.x was started with: Ray: 2.9.3 Python: 3.10.12

I have the following yaml file, `cluster_name: minimal

provider: type: gcp region: europe-west4 availability_zone: europe-west4-a project_id: project_name_xxx min_workers: 1 max_workers: 4 min_workers_python: "3.10.12" max_workers_python: "3.10.12" `

I am using google cloud to create this cluster. Even though, I am having this issue, the node was added to the cluster. But, I cannot submit a job to the cluster. I find the same issue while submitting the job.

I really appreciate your help.

Versions / Dependencies

RuntimeError: Version mismatch: The cluster was started with: Ray: 2.9.3 Python: 3.10.13 This process on node xx.xxx.x.x was started with: Ray: 2.9.3 Python: 3.10.12

Reproduction script

config.yaml

cluster_name: minimal

provider: type: gcp region: europe-west4 availability_zone: europe-west4-a project_id: project_id_xxx # Add your GCP project ID here min_workers: 1 max_workers: 4 min_workers_python: "3.10.12" # Specify the Python version for minimum workers max_workers_python: "3.10.12" # Specify the Python version for maximum workers

script.py

import ray

@ray.remote def hello_world(): return "hello world"

ray.init() print(ray.get(hello_world.remote()))

Issue Severity

High: It blocks me from completing my task.

davideuler commented 8 months ago

I've came across the same error on Ray 2.10.0, the head version is 3.10, the worker version is 3.10. The patch version conflicts. I run the ray cluster on kubernetes.

 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/scripts/scripts.py", line 954, in start
    node.check_version_info()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/node.py", line 396, in check_version_info
    ray._private.utils.check_version_info(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/utils.py", line 1595, in check_version_info
    raise RuntimeError(error_message)
RuntimeError: Version mismatch: The cluster was started with:
    Ray: 2.10.0
    Python: 3.10.13
This process on node 172.16.0.31 was started with:
    Ray: 2.10.0
    Python: 3.10.14
hongchaodeng commented 8 months ago

@architkulkarni Can you take a look and provide guidance?

architkulkarni commented 8 months ago

@rynewang I remember you had some context about this error. Were we going to make 3 digit (micro) version mismatches only print a warning instead of raising an error?

As a workaround, can you specify the Ray docker image in the cluster YAML spec? E.g. https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/gcp/example-gpu-docker.yaml I think that should make it impossible to have two different python versions on different nodes.

As for the root cause, I'm not sure why the python version isn't constant when a GCP instance is launched with default settings.