open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.09k stars 2.38k forks source link

[exporter/loadbalancing] Support consistency between scale-out events #33959

Open jamesmoessis opened 4 months ago

jamesmoessis commented 4 months ago

Component(s)

exporter/loadbalancing

Is your feature request related to a problem? Please describe.

When a scale-out event occurs, the loadbalancing exporter goes from having n endpoints to n+1 endpoints. So now, the exporting is divided differently after the scaling event is complete.

Consider the case where a trace has 2 spans, and the load balancing exporter is configured to route by trace ID. Span (a) arrives and is routed a given host. The scaling event occurs and now span (b) arrives, being routed to a different host.

We need a way to effectively scale-out our backend without these inconsistencies, while maintaining performance.

This is a separate problem to a scale-in event which in my opinion presents a different set of problems, and requires the terminating node to flush any data it's statefully holding onto. It may be worth discussing here but I want to focus on the simpler case of a scale-out event.

Describe the solution you'd like

I don't know exactly what the solution should be yet, and I'm hoping that this thread will provide discussion so we can reach a solution that works.

Essentially any solution that I've seen discussed for this problem includes some kind of cache which holds trace ID as the key and the backend as the value. My question to community is: how should this cache be implemented?

Describe alternatives you've considered

No response

Additional context

No response

github-actions[bot] commented 4 months ago

Pinging code owners:

jpkrohling commented 4 months ago

This is a problem inherent to distributed systems: there's only so much we can do before documenting the use-cases we won't support. When designing the load-balancing exporter, the trade-off was: either scaling events aren't frequent and we need fewer sync stages (periodic intervals for resolvers can be longer), or scaling events are frequent and need more/shorter refresh intervals.

This still requires no coordination between the nodes, acknowledging that they might end up making a different decision for the same trace ID during the moments where one load balancer is out of sync with the others. Hopefully, this is a short period of time, but this will happen. To alleviate some of the pain, I chose an algorithm that is a bit more expensive than the alternatives, but brought some stability in it: my recollection from the paper was that changes to the circle would affect only about a third of the hashes. I think it was (and still is, for most cases) a good compromise.

If we want to make it even better in terms of consistency, I considered a few alternatives in the past, which should be doable as different implementations of the resolver interface. My favorite is to use a distributed key/value store, like etcd or zookeeper, that would allow all nodes to get the same data at the same time, including their updates. Consensus would be handled there, but we'd still need to handle split brain scenarios (which is what I was trying to avoid in the first place). Another thing I considered in the past was to implement a gossip extension and allow load-balancer instances to communicate with each other, and we'd implement the consensus algorithm ourselves (likely raft, using an external library?). Again, split-brain would probably be something we'll have to handle ourselves.

If we want to have consistency even on split-brain scenarios... well, then I don't know :-) I'm not ready to think about a solution for that if we don't have that problem yet.

Anyway: thank you for opening this issue! I finally got those things out of my head :-)

github-actions[bot] commented 2 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.