ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.55k stars 5.7k forks source link

[Serve] Look into request got stuck issue #47585

Open GeneDer opened 1 month ago

GeneDer commented 1 month ago

https://ray.slack.com/archives/CNCKBBRJL/p1724281728601689?thread_ts=1723916926.382399&cid=CNCKBBRJL

GeneDer commented 1 month ago

This is actually from an Anyscale customer ofOne and is brought up again at https://anyscaleteam.slack.com/archives/C04UR8J709K/p1726705443604199

We are able to download their logs and have some example requests that got into this state for example request id 85ece69a-4d51-4ab7-ade8-209f3b702ebf has the below logs

(ray) gene@geneanyscale2023 test_grcp_2 % grep -r "85ece69a-4d51-4ab7-ade8-209f3b702ebf" /tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5-01000000-5752.err:INFO 2024-09-18 22:21:50,573 app1_Ingress 5gsucce2 85ece69a-4d51-4ab7-ade8-209f3b702ebf / app.py:71 - Ray health check initiated.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5-01000000-5752.err:INFO 2024-09-18 22:21:50,574 app1_Ingress 5gsucce2 85ece69a-4d51-4ab7-ade8-209f3b702ebf / replica.py:408 - __CALL__ OK 2.5ms
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:18:52,989 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 50 attempts over 44.07s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:19:43,100 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 100 attempts over 94.18s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:20:33,204 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 150 attempts over 144.29s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:21:23,319 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 200 attempts over 194.40s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:22:13,431 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 250 attempts over 244.52s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:23:03,525 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 300 attempts over 294.61s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:23:53,621 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 350 attempts over 344.71s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:24:43,736 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 400 attempts over 394.82s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:25:53,989 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 50 attempts over 44.11s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:26:44,151 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 100 attempts over 94.27s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:27:34,364 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 150 attempts over 144.48s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:28:24,567 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 200 attempts over 194.68s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:18:52,989 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 50 attempts over 44.07s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:19:43,100 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 100 attempts over 94.18s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:20:33,204 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 150 attempts over 144.29s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:21:23,319 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 200 attempts over 194.40s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5-01000000-5752.err","message":"INFO 2024-09-18 22:21:50,573 app1_Ingress 5gsucce2 85ece69a-4d51-4ab7-ade8-209f3b702ebf / app.py:71 - Ray health check initiated."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5-01000000-5752.err","message":"INFO 2024-09-18 22:21:50,574 app1_Ingress 5gsucce2 85ece69a-4d51-4ab7-ade8-209f3b702ebf / replica.py:408 - __CALL__ OK 2.5ms"}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:22:13,431 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 250 attempts over 244.52s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:23:03,525 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 300 attempts over 294.61s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:23:53,621 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 350 attempts over 344.71s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:24:43,736 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 400 attempts over 394.82s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:25:53,989 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 50 attempts over 44.11s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:26:44,151 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 100 attempts over 94.27s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:27:34,364 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 150 attempts over 144.48s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/combined-worker.log:{"filename":"worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err","message":"WARNING 2024-09-18 22:28:24,567 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 200 attempts over 194.68s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1."}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/replica_app1_Ingress_5gsucce2.log:{"asctime": "2024-09-18 22:21:50,392", "levelname": "WARNING", "message": "Replica at capacity of max_ongoing_requests=5, rejecting request 85ece69a-4d51-4ab7-ade8-209f3b702ebf.", "filename": "replica.py", "lineno": 556, "log_to_stderr": false, "worker_id": "52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5", "node_id": "b780a6ab26f3c037b81dbbb86b3b957f042d083e4a8d834cb125318d", "actor_id": "392919b96895babf9e9cc8fa01000000", "deployment": "app1_Ingress", "replica": "5gsucce2", "component_name": "replica"}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/replica_app1_Ingress_5gsucce2.log:{"asctime": "2024-09-18 22:21:50,572", "levelname": "INFO", "message": "Started executing request to method '__call__'.", "filename": "replica.py", "lineno": 1152, "log_to_stderr": false, "worker_id": "52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5", "node_id": "b780a6ab26f3c037b81dbbb86b3b957f042d083e4a8d834cb125318d", "actor_id": "392919b96895babf9e9cc8fa01000000", "route": "/", "request_id": "85ece69a-4d51-4ab7-ade8-209f3b702ebf", "application": "app1", "deployment": "app1_Ingress", "replica": "5gsucce2", "component_name": "replica"}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/replica_app1_Ingress_5gsucce2.log:{"asctime": "2024-09-18 22:21:50,573", "levelname": "INFO", "message": "Ray health check initiated.", "filename": "app.py", "lineno": 71, "route": "/", "request_id": "85ece69a-4d51-4ab7-ade8-209f3b702ebf", "application": "app1", "worker_id": "52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5", "node_id": "b780a6ab26f3c037b81dbbb86b3b957f042d083e4a8d834cb125318d", "actor_id": "392919b96895babf9e9cc8fa01000000", "deployment": "app1_Ingress", "replica": "5gsucce2", "component_name": "replica"}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/replica_app1_Ingress_5gsucce2.log:{"asctime": "2024-09-18 22:21:50,574", "levelname": "INFO", "message": "__CALL__ OK 2.5ms", "filename": "replica.py", "lineno": 408, "route": "/", "request_id": "85ece69a-4d51-4ab7-ade8-209f3b702ebf", "application": "app1", "worker_id": "52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5", "node_id": "b780a6ab26f3c037b81dbbb86b3b957f042d083e4a8d834cb125318d", "actor_id": "392919b96895babf9e9cc8fa01000000", "deployment": "app1_Ingress", "replica": "5gsucce2", "component_name": "replica"}
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:18:52,989 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 50 attempts over 44.07s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:19:43,100 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 100 attempts over 94.18s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:20:33,204 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 150 attempts over 144.29s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:21:23,319 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 200 attempts over 194.40s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:22:13,431 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 250 attempts over 244.52s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:23:03,525 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 300 attempts over 294.61s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:23:53,621 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 350 attempts over 344.71s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:24:43,736 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 400 attempts over 394.82s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:25:53,989 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 50 attempts over 44.11s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:26:44,151 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 100 attempts over 94.27s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:27:34,364 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 150 attempts over 144.48s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/serve/proxy_10.0.255.192.log:WARNING 2024-09-18 22:28:24,567 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 200 attempts over 194.68s. Retrying. Request ID: 79a8944b-89fe-448b-9293-2cb06e4cc8b1.
(ray) gene@geneanyscale2023 test_grcp_2 %

Note that the OK is logged at 2024-09-18 22:21:50,574

/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-52c7b1dab011acf14a5359819fe183f513ecf835e3343f99991f3ba5-01000000-5752.err:INFO 2024-09-18 22:21:50,574 app1_Ingress 5gsucce2 85ece69a-4d51-4ab7-ade8-209f3b702ebf / replica.py:408 - __CALL__ OK 2.5ms

but at 2024-09-18 22:22:13,431 proxy keeps retrying and failed to schedule the request

/tmp/logs/ses_lww6sc8ampkvp1yd664fe74ivk/session_2024-09-16_05-52-32_342420_3262/head-10.0.255.192-i-0ffe4f3c878f14184/worker-dd8650c4b2b143ce5679fd36aba84da81a726323718f3a7f0d1a5af7-01000000-5033.err:WARNING 2024-09-18 22:22:13,431 proxy 10.0.255.192 85ece69a-4d51-4ab7-ade8-209f3b702ebf / pow_2_scheduler.py:698 - Failed to schedule request after 250 attempts over 244.52s. Retrying. Request ID: 85ece69a-4d51-4ab7-ade8-209f3b702ebf.

Still don't have a good way to reproduce, but seems like a race condition between the scheduler and the replica

GeneDer commented 4 weeks ago

This seems to be reproducible by this deployment setup

import asyncio
import logging
from fastapi import FastAPI
from ray import serve
from ray.serve.handle import DeploymentHandle

webapp = FastAPI()
logger = logging.getLogger(__name__)

@serve.deployment(
    autoscaling_config={
        "max_ongoing_requests": 5,
        "min_replicas": 3,
        "max_replicas": 3,
    },
)
@serve.ingress(webapp)
class Ingress:
    def __init__(self, pipeline: DeploymentHandle):
        self.pipeline = pipeline

    @webapp.get("/")
    async def health_check(self) -> str:
        logger.info("Ray health check initiated.")
        return "Healthy!"

    @webapp.post("/conversations")
    async def start_conversation(self):
        logger.info("in start_conversation before sleep")
        await asyncio.gather(
            self.pipeline.start.remote(),
        )
        logger.info("in start_conversation after sleep")

@serve.deployment
class Pipeline:
    async def start(self):
        await asyncio.sleep(310)

app = Ingress.bind(Pipeline.bind())

Deploy to Anyscale service with anyscale service deploy --name gene-test-timeout2 deployment:app

Then use the following code to call the service

import requests
import ray

@ray.remote
def query_get(url, token):
    headers = {"Authorization": f"Bearer {token}"}
    response = requests.get(url, headers=headers)
    print(response.text)

@ray.remote
def query_post(url, token):
    headers = {"Authorization": f"Bearer {token}"}
    response = requests.post(url, headers=headers)
    print(response.text)

service_url = "https://gene-test-timeout2-bxauk.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com/conversations"
token = "ae_1me8Z5Mbm-hJ4eFEyzB01S33cvNQd4OmR6jtkrQE"
ray.get([query_post.remote(service_url, token) for _ in range(11)])
zcin commented 3 weeks ago

So there are two issues:

  1. If a request is cancelled while the scheduler is trying to find an available replica for that request, then the scheduler will not remove that request from processing until a replica is available (then it simply bypasses that request). This is an open PR for fixing that issue: https://github.com/ray-project/ray/pull/47847. In this case the request is correctly cancelled, it will just sometimes be delayed.
  2. Recursive cancellation from Ray doesn't cover all cases, so that if the scheduler is in the middle of trying to assign the request to a replica, if an asyncio.CancelledError isn't sent directly to the router (which isn't guaranteed to happen), then the request won't get cancelled at all (and a request could still get forwarded to the downstream deployment). For this we will have to implement full recursive cancellation in the Ray Serve layer.