Node executor working in tandem with a caching proxy should act more resiliently to the service unavailabilities in Ready state

slipstream / SlipStreamClient

SlipStream Python client

Apache License 2.0

1 stars 4 forks source link

Here is the run aborted in Ready state due to 502.

ss:abort- Exception with detail: ('Failed calling method GET on url https://nuv.la/run/f81e00ff-a18a-43b4-afa3-4c67c9e80a47/ss:state?ignoreabort=true, with reason: 502: Bad Gateway',)

This can be a problem in case of long-running auto-scalable runs.

Solutions might be:

on node executor, in case the service unavailability (5xx) is detected in Ready state, when the service comes back don't abort the run as the first thing.
increase TTL of cached ss:state (when the value is Ready) in the caching proxy when the upstream is not available (at the moment it's 10 sec (?)).
return another status code (e.g. 503 with Retry-After header) for ss:state RTP resource, so that node executor can act more wisely in case of Ready state.

This can be a problem in case of long-running auto-scalable runs.

For mutable(scalable) run there is no limit on the number of the retry the node executor will made so it will not be a problem.

return another status code (e.g. 503 with Retry-After header) for ss:state RTP resource, so that node executor can act more wisely in case of Ready state.

Currently the Server doesn't tell to the client how many time to wait but instead the client implement an "exponential backoff" algorithm. Retry-After doesn't make sense in the case of a server issue (like a 502) because if the server is crashed it cannot determine how many time the client has to wait before retry and nginx cannot either because it doesn't know when the server will come back.

To summarize:

If the Run(Deployment) is mutable(scalable) the client will never stop to retry.
If it's not mutable(scalable) it will fail after a certain amount of time et retries.
Except if the status code is a 503 Service Unavailable (maintenance mode). So this is why it's important to set this mode when we do a maintenance.

slipstream / SlipStreamClient

Node executor working in tandem with a caching proxy should act more resiliently to the service unavailabilities in Ready state #219