nodejs / undici

An HTTP/1.1 client, written from scratch for Node.js
https://nodejs.github.io/undici
MIT License
6.09k stars 530 forks source link

Pool distributes load unevenly across connections. #3648

Open JustinTRoss opened 4 days ago

JustinTRoss commented 4 days ago

Bug Description

Pool distributes load unevenly across connections, prioritizing them by creation time. See here.

Where your connections are load-balanced across instances (pods/containers), this results in overuse of some instances and underuse of others.

Reproducible By

  1. Create a service, ServiceA, with an endpoint that takes 100ms - 200ms to respond.
  2. Using Kubernetes, deploy 5 pods with instances of ServiceA.
  3. In another service, ServiceB, create a Pool of connections to ServiceA, referencing it by Kubernetes Service address (https://service-name), such that we'll be load-balancing connections through kube-proxy.
  4. Send 1 request to ServiceB every 50ms for 1 minute.
  5. Observe balanced connections across ServiceA pods, but imbalanced request distribution and resource usage across ServiceA pods.

Thinking through, this may be otherwise observable plainly through internal diagnostics around amount of requests processed by each Client or something like that.

Expected Behavior

Roughly balanced request distribution and resource usage across Clients and ServiceA pods.

Environment

MacOS, Node v18

Additional context

There are a number of ways we could resolve this intelligently. Most simply/crudely though, we could track index of last used client and pick back up next scan from that index to roughly round-robin.

Something like:

...
const totalClients = clients.length;
let clientIndex = (this.lastUsedIndex + 1) % totalClients;

for (let i = 0; i < totalClients; i++) {
  const client = clients[clientIndex];

  if (!client[kNeedDrain]) {
    this.lastUsedIndex = clientIndex;
    return client
  }

  clientIndex = (clientIndex + 1) % totalClients;
}
Uzlopak commented 4 days ago

Imho duplicate of #310

mcollina commented 4 days ago

The problem stems from the use of Keep-Alive connections. Essentially, we prioritize "alive" connections as much as possible because they are the ones we know are lively and less likely to timeout/error due to network disconnections. This is suboptimal in case of low traffic but minimizes errors. In a high-traffic scenario, this is equivalent to round-robin.

Have you tried using the BalancedPool? It should be implementing a better algorithm that would work better for your use case.

JustinTRoss commented 4 days ago

Thanks for the quick response.

I see the challenge of balancing against keeping connections alive, prioritizing sending requests through as few connections as possible to ensure maximum reuse. I don't understand the bit about being suboptimal for low traffic and round-robin for high-traffic. It seems the opposite to me.

As I reason about it, with low traffic, long request processing times, and short keep-alives, it will be round-robin, but there is an edge case in which we potentially fail to reuse a connection occasionally. With high traffic, we end up with a linear trend of more requests processed by earlier connections and fewer by later connections, with order of connection index use ending up something like: 1, 2, 3, 1, 4, 2, 1, 5, 2, 3, 1

I haven't used BalancedPool. I've looked at it, but it seemed to be for a case of multiple different upstream targets, such using it for one upstream would need to be something like:

new BalancedPool([
  "http://service-a",
  "http://service-a",
  "http://service-a",
  "http://service-a",
  "http://service-a",
]);

Is my understanding off?

JustinTRoss commented 4 days ago

I think I follow actually. We'd never let go of connections once created, because all connections would have roughly the same duration since the previous request. Then, once the pool got large enough and request volume dropped enough, it would timeout all connections at once.

Does that capture the reasoning?

mcollina commented 4 days ago

I think I follow actually. We'd never let go of connections once created, because all connections would have roughly the same duration since the previous request. Then, once the pool got large enough and request volume dropped enough, it would timeout all connections at once.

Does that about sum it up?

Yes, exactly.


If we can improve BalancedPool to cater for the Kubernetes/load balancer case, please send a PR!

JustinTRoss commented 4 days ago

If we can improve BalancedPool to cater for the Kubernetes/load balancer case, please send a PR!

I'd be happy to if it makes sense and we can identify a sensible approach. It doesn't seem intended for that case though, as I share above.

How so are you thinking it could be used to balance requests across a single upstream service?

mcollina commented 4 days ago

I think the algorithm used in balanced pool could be extended to work "in front" of a load balancer.

JustinTRoss commented 4 days ago

We can potentially steal inspiration from the http lib as well. Using Axios, the requests were distributed fairly evenly. We only noticed the change when switching to Undici. Request volume for this service is 500-1000 per second.

mcollina commented 4 days ago

A PR would be highly welcomed.