openfaas / of-watchdog

Reverse proxy for STDIO and HTTP microservices
MIT License
259 stars 115 forks source link

Add support for a ready endpoint, based upon max_inflight and change timers to use a Context with a deadline. #143

Closed alexellis closed 1 year ago

alexellis commented 1 year ago

Description

Add support for a ready endpoint, based upon max_inflight and change timers to use a Context with a deadline.

Motivation and Context

1) Ready endpoint introduced 2) Documentation updated for usage 3) Logic for timing and stopping requests that run for too long has been switched over to use a context.

1) Currently, when a function is overloaded with too many requests vs max_inflight i.e. 10 inflight with a limit of 2, then a 429 / retry message is returned to the caller.

Kubernetes keeps all the Pods that are overloaded in its ready pool, so a caller like the queue-worker may have to call many times to finally hit a pod with capacity.

By implementing a ready endpoint, the readiness check for an individual function can be implemented, so that Kubernetes removes it from the pool of endpoints when it exceeds its max_inflight. It will return again when it's once again deemed to be ready.

Tested with a large amount of requests through the queue- worker, pods were removed from the pool of endpoints as expected.

2) The documentation was reorganise and updated for new options and other were removed which were not relevant.

3) As per https://github.com/openfaas/of-watchdog/commit/47d5c31a5a182999ea3f904d187355023ad75955, there were other modes which had an issue that could be resolved by using a Context with a deadline instead of a timer.

How Has This Been Tested?

Tested with unit tests on the new change to the concurrency limiter to expose a Met() method used in the new Ready endpoint.

No harm to existing functions which won't use the /_/ready path.

Then tested with some load with the queue-worker. I observed Pods being removed from the readiness count and coming back online again as expected.

This testing revealed a bug in the scale from zero logic of the gateway which was fixed here: https://github.com/openfaas/faas/pull/1753

Types of changes

Checklist:

alexellis commented 1 year ago

Thanks for the review @LucasRoesler

kevin-lindsay-1 commented 1 year ago

LGTM