Consolidated health check service

samsp-msft commented 2 years ago

What should we add or change to make your life better?

YARP can do active health checks against backend servers to make sure that they are able to respond successfully to requests. In the case of having a number of YARP proxy instances, and a large number of backends, each YARP instance will need to ping each back end for health checks. As the number of servers grows, the health checks will also grow, potentially exponentially.

For scale out scenarios, YARP should have the ability to run the health check as a separate service. That should be runnable on a limited number of servers, which will perform the health checks and then provide the data to other YARP instances.

Concept

YARP includes a consolidated health check service which can be configured to run on a server or two. This service will talk to the configuration server #1710 to understand the cluster and destination definitions. It will perform the active health checks against the servers based on the URL definitions in config.

The configuration service will act as a broker, enabling instances to discover each other:

The health check service will register with the configuration service telling it of its presence
YARP proxy servers will get the address of the health check service from the configuration service. They will use its active health check data to determine which sites are unwell.
YARP proxies will continue to use passive health checks against backends. If/when they determine that a backend is likely unhealthy, they will notify the consolidated health check service.
- The health check service will make a determination if the destination is healthy, so that one bad YARP instance can't DoS the entire system.
If there are multiple health check servers, then they will inform each other of health changes. Similar to notifications from destination servers, it will make its own determination on actual the health status.

Proposal

Consolidated health checks will be dependent on having a configuration server. This features value is mostly when used in a scale out scenario where there are multiple YARP instances. The configuration server will provide the orchestration of YARP instances knowing about the health check server, and also for the health check server knowing about the configuration of the clusters and destinations.

The configuration server will include configuration data about the health checks as part of the configuration that is exposed via rest endpoints, and notifications in either direction about a specific destination's health.

rwkarg commented 2 years ago

It may be interesting to allow different implementations of the Config Service or Health Check service.

For example, we're planning on using Orleans to run health checks in the cluster (but only one instance of each health check) without needing to manage extra services. We haven't yet implemented, but could implement further segmentation by availability zone or other failure zone by adding that to the Orleans grain id for the health check. Ex. all us-west2-a LBs hit the <route>/us-west2-a/HealthCheck grain and similar for us-west2-b -> <route>/us-west2-b/HealthCheck. If the segmentation (AZ in this example) is flexible then any segmentation can be provided for a particular LB.

An Orleans based implementation may not be provided out of the box, but we'd like to be able to utilize a similar implementation behind an abstraction if possible.

One of the reasons we have integrated with Orleans is that it is also useful for things like the Config Server (have a single point of computation for merged routes/clusters from multiple k8s clusters) and we also use it for rate limiting across a collection of load balancers. If you're running proxies at scale, you have to solve the distributed system problems somehow; Orleans is how we're solving it without needing to implement the whole distributed foundation from scratch.

samsp-msft commented 2 years ago

Consider #267 where the destinations have scheduled downtime, that that health check system could push out that data to proxies.

samsp-msft commented 1 year ago

Feedback from a 1P team - For Http/2 the health checks ensure that there are warm connections to all destinations.

Tratcher commented 1 year ago

edit that's not HTTP/2 specific, it also helps for HTTP/1.x.

microsoft / reverse-proxy