ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.5k stars 5.69k forks source link

[Core] The remote function in the worker no longer runs after the head crashes #32454

Open YQ-Wang opened 1 year ago

YQ-Wang commented 1 year ago

What happened + What you expected to happen

In the Ray v2 whitepaper, it mentions that

Note that any running Ray tasks and actors will remain alive since these components do not require reading or writing the GCS. Similarly, any existing objects will continue to be available.

However, it seems that the remote function in the worker is no longer running after the head crashes.

Versions / Dependencies

Python 3.7 Ray 2.2.0

1 Head: 1 cpu 1 Worker: 1 cpu 1 Redis

Reproduction script

  1. Use the following commands to reproduce the env on local machine using kind:
    kubectl apply -f https://raw.githubusercontent.com/ray-project/ray/master/doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml
kubectl apply -f https://raw.githubusercontent.com/ray-project/ray/master/doc/source/cluster/kubernetes/configs/static-ray-cluster-networkpolicy.yaml
  1. Copy the following script to the head node:
    
    import ray
    import time

@ray.remote(num_cpus=1, max_calls=1) def write_redis(): import redis r = redis.Redis(host='redis', port=6379, decode_responses=True)

for i in range(120):
    time.sleep(1)
    r.set('counter', i)
    print(r.get('counter'))
return "hello world"

ray.init() print(ray.get(write_redis.remote()))


3. Port forward to the head node:

kubectl port-forward service/service-ray-cluster 8265:8265


4. Run the following code on local machine:

from ray.job_submission import JobSubmissionClient

client = JobSubmissionClient("http://127.0.0.1:8265")

job_id = client.submit_job( entrypoint="python script.py", runtime_env = {"pip": ["redis"]}, ) print(job_id)



What I observed:
After the job gets submitted, the remote function is able to call Redis to record the counter value every 1s.

![r1](https://user-images.githubusercontent.com/20148872/218231037-5cea5171-b997-4c1e-b886-6fd66a482a7d.png)

Then I set the head replica from 1 to 0. The value of counter stops at 53 as shown below:
![r2](https://user-images.githubusercontent.com/20148872/218231157-806b6a3e-7229-43dd-82ae-0c1588850737.png)

After that, I set the head replica back to 1, copy the script to the head node and port forward. I can see the old job is still running, but the Redis value cannot be updated.
![r3](https://user-images.githubusercontent.com/20148872/218231246-46c11d1b-3bd8-4f47-9d9b-b68931e61865.png)

Then I submitted a new job, the previous job immediately failed. And the Redis is written with the new value.
![r4](https://user-images.githubusercontent.com/20148872/218231280-432e096b-c3fa-4885-9db5-aa9d16a167d2.png)
![r5](https://user-images.githubusercontent.com/20148872/218231284-1c450346-ba0c-430e-8935-a0b3f12cdf88.png)

I think the above experiment shows that after the head crashes, the function in the worker also gets terminated (but the status shows as running in the dashboard for some reason). 

### Issue Severity

Medium: It is a significant difficulty but I can work around it.
cadedaniel commented 1 year ago

cc @architkulkarni what's the expected behavior for jobs when head node dies?

architkulkarni commented 1 year ago

The head node is supposed to recover the state of all running jobs, so this looks like some issue with that. @YQ-Wang if you happen to have logs for when this happens, it could be useful if you zipped up all the logs and shared them.

YQ-Wang commented 1 year ago

The head node is supposed to recover the state of all running jobs, so this looks like some issue with that. @YQ-Wang if you happen to have logs for when this happens, it could be useful if you zipped up all the logs and shared them.

Sure thing. Also, the above steps on this page can easily help you reproduce the issue.