I observed that NumServeEndpoints changes frequently especially after we start to watch Endpoints in #2080. The error message is:
Get \"http://10.244.0.6:8000/-/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
The timeout of the HTTP client is 20 ms. Hence, I increase the timeout to 2 seconds which is the same as the dashboard HTTP client.
Test
With this PR, CheckProxyActorHealth does not fail during my 30-minute experiment. See this gist for more details.
Without this PR (also no #2080), CheckHealth fails 6 times in my 30-minute experiment. See this gist for more details
I marked it as 'Hotfix' because I think 20 ms should be enough for my very simple setup (single Ray node, local Kind cluster, no requests). Hence, the instability may be a Ray Serve issue.
Why are these changes needed?
I observed that
NumServeEndpoints
changes frequently especially after we start to watchEndpoints
in #2080. The error message is:The timeout of the HTTP client is 20 ms. Hence, I increase the timeout to 2 seconds which is the same as the dashboard HTTP client.
CheckProxyActorHealth
does not fail during my 30-minute experiment. See this gist for more details.CheckHealth
fails 6 times in my 30-minute experiment. See this gist for more detailsI marked it as 'Hotfix' because I think 20 ms should be enough for my very simple setup (single Ray node, local Kind cluster, no requests). Hence, the instability may be a Ray Serve issue.
Related issue number
Checks