open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.09k stars 2.38k forks source link

Load Balancing by Attribute #33660

Open danielbanks opened 5 months ago

danielbanks commented 5 months ago

Component(s)

exporter/loadbalancing

Is your feature request related to a problem? Please describe.

In our project want to setup load balanced sampling based on a session ID an attribute.

In our project, we care about RUM and we have the concept of a user session with a session ID set as an attribute on our traces. We sample based on the session ID rather than the trace ID. This way we preserve telemetry of the whole user session.

The sampling is working fine but it is client-side head sampling. As we look to scale our solution we want to move this to the collector and introduce load balancing.

How do we achieve session sampling?

Right now we use the same ID generator for trace IDs as for session and we set it as an attribute. Then we have a head sampling strategy that uses the same logic as the probabilistic head sampler, but rather than applying the decision to the trace ID we apply it to the session ID. This ensures we make the same sampling decision for the whole user session.

What we would like to do is move these sampling decisions off the client and into the collector so that we have more flexibility. Our client is an Android application and making these decisions client side is not a long-term solution because we have to deal with application updates etc.

Following the recommended practice we would like to have a 2 layer collector setup, with the first layer load balancing the second. The issue is that the load balancer only supports decisions based on trace ID or service name.

Given that we want to sample based on session ID (an attribute), then making load balancing decisions on trace ID alone is not enough. We need to load balance telemetry with the same session ID to the same collector instance so that consistent sampling decisions can be made.

It doesn't look like the load balancer currently supports balancing based on an attribute. This is a friendly request to add it!

Describe the solution you'd like

The ability to route telemetry based on attributes in addition to service name and trace ID

Describe alternatives you've considered

No response

Additional context

No response

github-actions[bot] commented 5 months ago

Pinging code owners:

jpkrohling commented 4 months ago

I believe there are a couple of comments to this:

  1. balancing based on an arbitrary attribute is doable, and we are doing that already for the service name. It should be easy to extend this function here to do that: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/2aa0e6b717c1b9f228552cc91f2214beb72fcde2/exporter/loadbalancingexporter/trace_exporter.go#L135-L165
  2. I'm not quite sure you need two layers: if you are doing probabilistic sampling based on the session ID, it's pretty much the same idea we have for the probabilistic sampling at the collector, which means that it can be consistent across collector instances without the need to centralize all session IDs on the same decision instances. So, you might not need the balancer to know about session IDs at all
danielbanks commented 4 months ago

Thanks for the reply @jpkrohling. That's useful insight.

I'd like to move our probabilistic sampling of sessions into the collector rather than having this client side. But the sampler configuration can only specify custom attributes for logs not traces. Our target solution is to have load-balanced telemetry across logs and traces, which is sampled based on complete sessions. We want to observe the users sessions so that we can understand the full journey.

Do you have any recommendations for how this can be achieved with the current tooling?

jpkrohling commented 4 months ago

Take a look at the code for the probabilistic sampling processor at contrib. It could be changed to use specific attributes instead of trace ID, which would be sufficient for your use case, if I'm understanding it correctly.

github-actions[bot] commented 2 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.