Open YQ-Wang opened 1 year ago
cc @architkulkarni what's the expected behavior for jobs when head node dies?
The head node is supposed to recover the state of all running jobs, so this looks like some issue with that. @YQ-Wang if you happen to have logs for when this happens, it could be useful if you zipped up all the logs and shared them.
The head node is supposed to recover the state of all running jobs, so this looks like some issue with that. @YQ-Wang if you happen to have logs for when this happens, it could be useful if you zipped up all the logs and shared them.
Sure thing. Also, the above steps on this page can easily help you reproduce the issue.
What happened + What you expected to happen
In the Ray v2 whitepaper, it mentions that
However, it seems that the remote function in the worker is no longer running after the head crashes.
Versions / Dependencies
Python 3.7 Ray 2.2.0
1 Head: 1 cpu 1 Worker: 1 cpu 1 Redis
Reproduction script
@ray.remote(num_cpus=1, max_calls=1) def write_redis(): import redis r = redis.Redis(host='redis', port=6379, decode_responses=True)
ray.init() print(ray.get(write_redis.remote()))
kubectl port-forward service/service-ray-cluster 8265:8265
from ray.job_submission import JobSubmissionClient
client = JobSubmissionClient("http://127.0.0.1:8265")
job_id = client.submit_job( entrypoint="python script.py", runtime_env = {"pip": ["redis"]}, ) print(job_id)