Open psydok opened 10 months ago
I also get this error when I try to update the app (the cluster shows that the application itself is running, but the controller and proxy on the head node are not running):
WARNING worker.py:2052 -- The autoscaler failed with the following error:
5Terminated with signal 15
6 File "/usr/local/lib/python3.11/site-packages/ray/autoscaler/_private/monitor.py", line 711, in <module>
7 monitor.run()
8 File "/usr/local/lib/python3.11/site-packages/ray/autoscaler/_private/monitor.py", line 586, in run
9 self._run()
10 File "/usr/local/lib/python3.11/site-packages/ray/autoscaler/_private/monitor.py", line 440, in _run
11 time.sleep(AUTOSCALER_UPDATE_INTERVAL_S)
@sihanwang41 please triage
What happened + What you expected to happen
serve start --http-host 0.0.0.0.0 --http-port=8000 --grpc-port=9000 ....
on the head node. And started my application via commandray job submit
command. Everything works.SERVE_CONTROLLER_ACTOR
in statusRESTARTING
(inside the pageSERVE_CONTROLLER_ACTOR
-> PENDING_NODE_ASSIGNMENT. Although the remoted node has not changed, only the head has changed):Found this error. But I have no idea how to set soft=True to
serve start ...
.:Versions / Dependencies
python==3.11.5 ray==2.8.1 kubectl==v0.23.1 argocd==v2.4.12+41f54aa
Reproduction script
Issue Severity
High: It blocks me from completing my task.