Add support for a ready endpoint, based upon max_inflight and change timers to use a Context with a deadline.
Motivation and Context
1) Ready endpoint introduced
2) Documentation updated for usage
3) Logic for timing and stopping requests that run for
too long has been switched over to use a context.
1) Currently, when a function is overloaded with too many
requests vs max_inflight i.e. 10 inflight with a limit of
2, then a 429 / retry message is returned to the caller.
Kubernetes keeps all the Pods that are overloaded in its
ready pool, so a caller like the queue-worker may have to call
many times to finally hit a pod with capacity.
By implementing a ready endpoint, the readiness check for an
individual function can be implemented, so that Kubernetes removes
it from the pool of endpoints when it exceeds its max_inflight.
It will return again when it's once again deemed to be ready.
Tested with a large amount of requests through the queue-
worker, pods were removed from the pool of endpoints as
expected.
2) The documentation was reorganise and updated for new options
and other were removed which were not relevant.
Description
Add support for a ready endpoint, based upon max_inflight and change timers to use a Context with a deadline.
Motivation and Context
1) Ready endpoint introduced 2) Documentation updated for usage 3) Logic for timing and stopping requests that run for too long has been switched over to use a context.
1) Currently, when a function is overloaded with too many requests vs max_inflight i.e. 10 inflight with a limit of 2, then a 429 / retry message is returned to the caller.
Kubernetes keeps all the Pods that are overloaded in its ready pool, so a caller like the queue-worker may have to call many times to finally hit a pod with capacity.
By implementing a ready endpoint, the readiness check for an individual function can be implemented, so that Kubernetes removes it from the pool of endpoints when it exceeds its max_inflight. It will return again when it's once again deemed to be ready.
Tested with a large amount of requests through the queue- worker, pods were removed from the pool of endpoints as expected.
2) The documentation was reorganise and updated for new options and other were removed which were not relevant.
3) As per https://github.com/openfaas/of-watchdog/commit/47d5c31a5a182999ea3f904d187355023ad75955, there were other modes which had an issue that could be resolved by using a Context with a deadline instead of a timer.
How Has This Been Tested?
Tested with unit tests on the new change to the concurrency limiter to expose a Met() method used in the new Ready endpoint.
No harm to existing functions which won't use the /_/ready path.
Then tested with some load with the queue-worker. I observed Pods being removed from the readiness count and coming back online again as expected.
This testing revealed a bug in the scale from zero logic of the gateway which was fixed here: https://github.com/openfaas/faas/pull/1753
Types of changes
Checklist:
git commit -s