ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.01k stars 5.58k forks source link

[Ray Serve] | Replicas keep failing health check under high QPS while scaling up #46702

Open danishshaikh556 opened 1 month ago

danishshaikh556 commented 1 month ago

What happened + What you expected to happen

Ray Scaling up or Deploys when having high traffic cause lots of replica health check failures

Screenshot 2024-06-26 at 3 42 00 PM Screenshot 2024-06-26 at 3 39 51 PM Screenshot 2024-06-26 at 3 39 15 PM serve_controller_152 (1).log

Versions / Dependencies

Ray 2.20.0 Kuberay 1.0.0

Reproduction script

N/A

Issue Severity

High: It blocks me from completing my task.

danishshaikh556 commented 1 month ago

Actors when failing usually have this error

ERROR 2024-07-18 16:15:15,862 TextSearchRerankingTransformerV10_TFServe j28b73mh af31c081-2e1e-4658-9cc8-7f95acf3ec8e /serve replica.py:359 - Request failed:
ray::ServeReplica:TextSearchRerankingTransformerV10:TFServe.handle_request_with_rejection() (pid=23474, ip=100.82.149.124)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/utils.py", line 168, in wrap_to_ray_error
    raise exception
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 1132, in call_user_method
    await self._call_func_or_gen(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 856, in _call_func_or_gen
    result = await result
  File "/tmp/ray/session_2024-07-15_15-46-35_769995_8/runtime_resources/working_dir_files/s3_sr-search-data-886239521314_sagemaker_deploy_text_search_reranking_trainer_v10_dag-ray_2024-07-15_model/serve.py", line 392, in __call__
    input_data: dict = await starlette_request.json()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/starlette/requests.py", line 251, in json
    body = await self.body()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/starlette/requests.py", line 244, in body
    async for chunk in self.stream():
  File "/home/ray/anaconda3/lib/python3.9/site-packages/starlette/requests.py", line 238, in stream
    raise ClientDisconnect()
starlette.requests.ClientDisconnect
ravishtiwari commented 1 month ago

I have seen this happening in high QPS scenarios, with similar warning:

WARNING 2024-06-26 14:58:55,964 controller 537 deployment_state.py:775 - Health check for Replica...  failed: 
zcin commented 1 month ago

@danishshaikh556 Is my understanding correct: When the deployment upscales and brings up new replicas, most (or all?) requests that are sent to the replica fail. Then you have a custom health check for this deployment that checks the success rate of requests, and that health check fails because of all the request failures. This causes the controller to kill the replicas, which then causes the p99 latency to spike?

danishshaikh556 commented 1 month ago

Yes this is correct with only one correlation, the replicas failing aren’t necessarily just the newly spun up ones, old ones mostly start failing too.

danishshaikh556 commented 1 month ago

@zcin adding more snapshot of metrics of what our system looks like when this happends Screenshot 2024-07-24 at 4 59 36 PM If you look you can see on high qps when scale up is happening replicas are dying, thus adding to latency(See latency replica graph)