open-telemetry / opentelemetry-go

OpenTelemetry Go API and SDK
https://opentelemetry.io/docs/languages/go
Apache License 2.0
5.36k stars 1.09k forks source link

otlpmetrichttp: load-balance between multiple endpoint ips #5838

Open sh0rez opened 2 months ago

sh0rez commented 2 months ago

Problem Statement

I want to horizontally scale the OTel collector and have the SDK (somewhat evenly) distribute requests to collector instances.

I have a Headless Service for my collector that returns all instances when querying via DNS:

$ dig otelcol
;; ANSWER SECTION:
otelcol.                600     IN      A       172.22.0.5
otelcol.                600     IN      A       172.22.0.8

However, because the Go HTTP Client which this package uses keeps the tcp connection alive, the SDK sticks to the first ever returned address until it becomes unreachable.

This also applies to regular k8s Services, because once the tcp conn is opened, no further loadbalancing from the k8s side takes place.

There is https://github.com/golang/go/issues/34511 requesting this for the standard library, but no real progress has been made since 2019.

Proposed Solution

Instead of relying on the HTTP Client to determine the endpoint out of the DNS list, do the following:

If deemed acceptable, I am happy to contribute this functionality

Alternatives

Disable Keepalive

By disabling TCP keepalive, a new connection is made on every request, which includes a DNS lookup. I confirmed this works by mangling with SDK internals, but is inefficient.

Use custom RoundTripper

In the Go issue the use of https://github.com/CAFxX/balancer is suggested.

This however leads to a DNS lookup on every request, which is undesirable

Have users deploy server-side loadbalancers

Of course this can be fixed server-side by deploying another layer of load-balancing proxies (nginx, etc) in front of the otel collector. This greatly complicates the pipeline setup though, as one might end up with 3 layers (http loadbalancing, stateless collector for sticky otlp loadbalancing, stateful collector for processing)

dmathieu commented 2 months ago

Keeping a list of multiple endpoints is something that would break the specification requirements for OTLP exporters. https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/exporter.md#configuration-options

Also, if we start doing that, it's a feature we're introducing to a stable component. We won't be able to remove it when/if Go fixes this and it's necessary anymore.

Using a custom round tripper/transport is also not going to be possible for now. See https://github.com/open-telemetry/opentelemetry-go/issues/2632

Disabling keep alives could be a valid option we add to the HTTP exporters clients.

dmathieu commented 5 days ago

@sh0rez can this be closed?