Add a Power of Two Choices alogrithm implementation for Spring Cloud LoadBalancer

OlgaMaciaszek commented 5 years ago

Provide a Power of Two Choices (https://www.nginx.com/blog/nginx-power-of-two-choices-load-balancing-algorithm/) implementation for Spring Cloud LoadBalancer.

Original issue: https://github.com/spring-cloud/spring-cloud-commons/issues/595

OlgaMaciaszek commented 4 years ago

The research underlying this approach.

OlgaMaciaszek commented 4 years ago

The use cases examined within the whitepaper (linked in the comment above) assume that the clients (in our case, possibly client-side load balancer instances) communicate directly with the servers to choose the best server to pass the request to.

The servers interact with the clients informing them about their current load. The communication between the clients and the servers is structured into rounds. The research includes the following (alternative) approaches:

As the clients send requests, the servers respond to each requesting client informing it which position in the FIFO queue the client would get within that round. Based on that each client picks the best server.
As the clients send requests, the servers either reject them or accept them based on a preset load threshold number - the clients keep retrying the requests in future rounds till they get accepted.

As the paper demonstrates, these approaches allow getting good results when it comes to load distribution (and resulting max loads vs numbers of requests and servers). However, they are communication-heavy, so the resulting communication costs have to be low enough to allow this approach to be profitable.

With that in mind, the research centres on what is the minimum number of servers (and, thus, smaller communication cost) that each client should communicate that would still give good results in terms of load distribution between the servers. The author proves that while there's a very substantial benefit from contacting two random servers and choosing the best one of them over just randomly picking one server to send requests to, adding any additional number of servers to choose from leads to much less significant improvements.

OlgaMaciaszek commented 4 years ago

In the context of implementing this within the Spring Cloud LoadBalancer, there are some issues to consider. The implementation of the ReactiveLoadBalancer itself should not be problematic, given that we could use the data on the current servers' load stored in the ServiceInstance metadata, however, the ServiceInstanceListSupplier implementation and how this information should be collected should probably be discussed.

Some things that come to mind:

While the algorithm centres around the scenario where clients interact directly with end-servers during the load-balancing, the most used ServiceInstanceListSupplier that we have is the DiscoveryClientServiceInstanceListSupplier, that interacts with service discovery; we could approach this in various ways:
- we could select the 2 services from SD and then communicate with them to get their load state and choose the best one for the final request - this approach requires an additional round of communication (as compared to the research), and might be too costly over HTTP, but that would probably be most in keeping with PoTC approach
- we could pass load info as service instance metadata while registering service instances in SD; then we could get this information directly from SD; the issue I see here is that the recommended ServiceInstanceListSupplier is the caching one - so given the fact that load tends to change quite dynamically we would have stale data; using a reactive push-based ServiceInstanceListSupplier implementation instead might be a possible solution, but it still may generate unnecessarily heavy traffic if we update it on each load change on each of the servers; also, if we use this approach, we might as well just pick the best server and one of two best servers as we have the list there already;
Server instances would have to be instrumented to know their load and have an endpoint where that could be queried (possibly using micrometer?)
We would have a different, interesting scenario with Spring Cloud RSocket - there the Gateway/ LB instances actually will have much more knowledge of the network topology, and we might be able to find best instances based on that

OlgaMaciaszek commented 4 years ago

I would like to start a discussion on the best implementation of PoTC. Have put some considerations in the comments above.

Please join the discussion and add any ideas and comments.

@spencergibb @ryanjbaxter @marcingrzejszczak @dsyer @Haybu @smaldini @nebhale @TYsewyn @elandau

Haybu commented 4 years ago

2. Server instances would have to be instrumented to know their load and have an endpoint where that could be queried (possibly using micrometer?)

in non-RSocket cases and given that services are instrumented, two things come to mind to help a client with decision making:

• could a client probably consult a monitoring toolkit such as Prometheus or its collected metrics somewhere via a Prometheus writer (to a time-series db..ect.)? the downside this happens at request time and could cause a slight delay • Prometheus writer to write instrumented metrics to a stream broker (such as Kafka), leveraging cloud stream a client to act as a listener/sink and proactively maintains services’ instances metrics (kafka KStream/Ktable) that could be consulted at the time of target services invocations.

I can see both options involve extra components :)

ojhughes commented 4 years ago

As the paper demonstrates, these approaches allow getting good results when it comes to load distribution (and resulting max loads vs numbers of requests and servers). However, they are communication-heavy, so the resulting communication costs have to be low enough to allow this approach to be profitable.

It could be possible to piggy back on top of underlying service discovering systems, so that when a client requests registration information it also includes data about the services load

OlgaMaciaszek commented 4 years ago

• could a client probably consult a monitoring toolkit such as Prometheus or its collected metrics somewhere via a Prometheus writer (to a time-series db..ect.)? the downside this happens at request time and could cause a slight delay • Prometheus writer to write instrumented metrics to a stream broker (such as Kafka), leveraging cloud stream a client to act as a listener/sink and proactively maintains services’ instances metrics (Kafka KStream/Ktable) that could be consulted at the time of target services invocations.

I like the second idea more cause would probably be more performant; although I think it's worth considering if we could get the information from the just with Micrometer, without Prometheus so that we require as little obligatory external components as possible; wdyt?

dsyer commented 4 years ago

Micrometer can only tell you about metrics for the local process, right? I also think prometheus is a good idea, but overkill for the simplest use cases. I definitely wouldn’t want it to be a mandatory dependency. We’ll probably end up rewriting Eureka if we take this idea much further.

OlgaMaciaszek commented 4 years ago

Micrometer can only tell you about metrics for the local process, right? I also think prometheus is a good idea, but overkill for the simplest use cases. I definitely wouldn’t want it to be a mandatory dependency. We’ll probably end up rewriting Eureka if we take this idea much further.

That's true, but for this scenario, it could work:

we could select the 2 services from SD and then communicate with them to get their load state and choose the best one for the final request - this approach requires an additional round of communication (as compared to the research), and might be too costly over HTTP, but that would probably be most in keeping with PoTC approach.

That being said, I'm also not convinced that this would be more efficient than using SD.

jleibund commented 4 years ago

I've picked up tracking this from the ticket @elandau registered for this. Just maybe a little input from our gRPC P2C impl that might apply here:

we use SD (eureka for us but could be any) for the active list to choose from
we do use metrics, connection state, etc when picking to include least pending requests on a connection/instance
we track prior failures and quarantine those nodes (locally) as a tie breaker (if one is quarantined)
we do not communicate with the two choices before choosing - there is some overhead mentioned, pathological cases, and diminishing returns over just metrics
we implement exponential backoff in the LB

Our metrics, internally are Spectator, I think Prometheus sounds good - but the provider, ideally would be plug-gable and ditto for the metrics/status evaluation algorithm - depending on the metrics. one has access to. Also, certainly leveraging pluggable SD would be helpful.

ryanjbaxter commented 4 years ago

I'm wondering if it makes sense that when a service sends its heart beat data to the service discovery server it also includes its current load data. This data could then be updated to service discovery clients when they send their heart beat delta and get updates about newly registered services. Yes, the cached data might be out of date, but we are already fine if the list of services is out of date as well.

Haybu commented 4 years ago

a metrics-aware discovery service is optimal if we are fine with outdated cached instances, as @dsyer mentioned it looks like a new enhanced Eureka.

dsyer commented 4 years ago

I think I would be happy with local metrics to start with (we have statistics about requests we sent to each service instance from the current JVM at least). It might help if we design the API in such a way that it doesn't preclude a remote implementation. Shouldn't be too hard though.

OlgaMaciaszek commented 4 years ago

we do not communicate with the two choices before choosing - there is some overhead mentioned, pathological cases, and diminishing returns over just metrics

It also makes sense to me to avoid that communication since we are communicating with the SR anyway and we can get information from there; however, in this context, why would we do the 2 random choices approach? It seems that the main benefit discussed in the paper is gaining a good load distribution while reducing communication to only 2 instances (the scenario that the algorithm concentrates on is one where we need to contact each instance to get its load status); if we can communicate only with one external app, i.e. the service registry, and get all that data in 1 communication, I'm not sure why we should not just pick the best instance instead of randomly picking 2 to chose one from them?

TYsewyn commented 4 years ago

I'm wondering if it makes sense that when a service sends its heartbeat data to the service discovery server it also includes its current load data.

That's a really good idea but it won't work for all SRs unfortunately, eg. Cloud Foundry or Kubernetes.

I'm also not sure what implications that would have for the heartbeats. IIRC the default setting for most SD clients is to send a heartbeat every 30 seconds. If the system is under high load that information would be out of date and invalid sub-second.

jleibund commented 4 years ago

Olga - if the SR is always up-to-date and propagation delay is acceptable then you could be right. Under real world situations with SR lag, outtages, bursty traffic, etc for us it was important for each instance to adapt for periods in isolation so in that sense we may have a variation of the paper as strictly interpreted. SR can suffer from production and replication issues, depending on its design, and refresh characteristics, caching, etc.

My comments pertain to a gRPC implementation (not straight http/servlets) so we have access to some additional information provided by the gRPC API that may or may not apply here. Our filtering flow looks approximately like this:

use SR to filter down to list of active candidates, track this locally (re your comment, this could be the first limit to remove overly loaded servers, but we do not currently do this)
for each candidate channel we locally (instance) track count of failures, count of backoffs (for an exp backoff strategy), whether we are in the middle of retrying (we support retries), last failure time (for backoff calc)
during choice of two, pick two random candidates from the active candidates pool
if one is quarantined (defined below) pick the other
if both are quarantined or both are ok, then apply least loaded logic (defined below)

In the above, our least loaded logic is based on the gRPC subchannel active count (basically the backlog of queued requests). So this (step 5, above) is a local loading decision that can adapt and not suffer from issues with a distributed SR caching / lag.

For the quarantine decision in step 4:

if its retrying, the candidate is quarantined for this decision
if its not in READY state (gRPC-specific) and the active count > 0 then its quarantined
if its tracked count of failures > 0 we apply the exponential backoff strategy based on the count of backoffs, last failure time and current time to decide if its quarantined

The above is just food for thought. Our gRPC choice of 2 may have started as a strict reading of the research but has evolved over time to account for a number of failure situations we have encountered in under high traffic volume. But its one of many possible, valid implementations that could more strictly apply the choice of 2 research as written or extend beyond the a controlled research environment.

spring-cloud / spring-cloud-commons

Add a Power of Two Choices alogrithm implementation for Spring Cloud LoadBalancer #601