open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.32k stars 1.43k forks source link

Change default otlp exporter GRPC load balancer to round robin #10298

Closed taniyourstruly closed 3 months ago

taniyourstruly commented 4 months ago

Is your feature request related to a problem? Please describe.

Using the pick_first load balancer will default to sending data to the same backend. This can potentially cause problems in resource allocation. Having one connection to the same IP address or backend would cause throttling. For scalable services, if this same backend is overloaded, it cannot accept anymore data sent and data can then be dropped instead of sending to another backend, especially if all are being used at max limit. Also, pick_first does no actual load balancing (link), and instead just tries each address from the name resolver and connects to the one that works.

Describe the solution you'd like

Updating the load balancer to use the round_robin policy will allow data to be sent to different backends, and therefore more evenly allocate resources. Round robin only picks ready connections, and so is better in a typical cloud-compute setup, where clients are sending data into a scalable service with multiple workers. This is because it allows for connections to alternate between backends, and therefore when resources on one backend are being used, the data is then moved to another available backend based on availability. In round_robin, users that want to send data to only one address can create connections against the address more than once to ensure that multiple connections can be made (link). This would not let the connections be throttled. In this case, pick_first would resolve to that one address and can only accept data when that connection is available, therefore causing throttling. When having more than one connection to that one address, like what round_robin is able to do, throttling is less of an issue since data can be sent to multiple locations that all send to that one address.

Describe alternatives you've considered

The alternative is to leave pick_first the default and recommend users make a choice. This is not an adequate solution because users expect reliable delivery by default and round_robin is substantially more reliable for minimal additional cost. We have restricted our choices to round_robin and pick_first because these two are registered by default, other custom load balancers would have to be registered by the user.

Additional context

I tested these two different load balancers in the Lightstep/SNCO dashboard. Envoy-edge, a proxy that load balances data, being sent to our service spaningest, which as its name implies, ingests OTLP spans using the arrow format. In the first image, we see that when sending data from our service, envoy to spaningest arrow, at first, looks pretty even. This is currently using round-robin load balancing, as it distributes traffic in rotation, to different k8s pods, and therefore the resource allocation is even. image After the time the change to pick_first is made, we see that each pod changes how much data(spans) is sent drastically, as some pods get more spans than others. image Comparing the two, we see that the differences using round robin and pick first load balancers are pretty apparent. Since pick first sends to the first pod that is available, data is sent to that pod and all the other pods are left using no resources. image

tsloughter commented 3 months ago

I was curious why round robin was chosen, was random, or any other, algorithms considered?

taniyourstruly commented 3 months ago

These are the two options supported by default by gRPC. Other load balancers do exist but the exact set varies by language and anything else would have to be registered by the user as a custom load balancer and involves implementing a load balancer interface. (link)

tsloughter commented 3 months ago

Right, but only grpc-go matters to the collector. I thought it supported random, but looks like I was wrong?