open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.92k stars 2.29k forks source link

Trace-preserving mode for processor/tailsampling #25122

Open garry-cairns opened 1 year ago

garry-cairns commented 1 year ago

Component(s)

processor/tailsampling

Is your feature request related to a problem? Please describe.

We would like to use tail-based sampling because we believe it will give better insights into our running processes than head-based, and we have far too much data volume to store 100% of traces. We would, however, like to retain connections between our aggregated metrics, which we produce using the spanmetrics connector, and our stored traces. This is not currently possible.

Describe the solution you'd like

We would like there to be a configurable option to separate the concerns of sampling from that of filtering. In this model, the tail-based sampling processor could be configured in a "soft" mode (the name isn't important if you prefer another) that would simply update sampling.priority on all spans for a trace it has decided to sample and do no filtering. This would let subsequent processors including, but not limited to, spanmetrics use this information. The user would then be responsible for filtering unsampled traces/spans using the filter processor in their trace pipeline(s).

To expand on the connector/spanmetrics example, this would involve a separate feature request to make its exemplar behavior smarter such that it would only include trace IDs where sampling.priority > 0 as exemplars of aggregated metrics in the presence of such an attribute. This means spanmetrics could produce accurate metrics based on 100% of traces, which it would need, without incurring the cost of storing all of those traces.

sampling

Describe alternatives you've considered

One alternative we considered was changing spanmetrics such that it would mutate any trace it used as an exemplar to make connections between its metrics and the traces from which they were derived simpler. But this would mean further changes to spanmetrics, which currently stores references to 100% of traces it uses to produce its output as "exemplars" and also couples the solution too tightly to spanmetrics. Our preferred solution leaves current behavior in place for those relying on it, while also offering a nice separation of concerns giving other users much more flexibility to innovate with their pipelines.

Additional context

We are working in an environment with many thousands of hosts running hundreds of thousands of services, each of which may pass context belonging to the same logical traces between them.

github-actions[bot] commented 1 year ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jpkrohling commented 1 year ago

I see the problem, and instead of using the filter processor, it would probably make sense to use a second-stage sampler as a connector:

receivers:
- otlp

processors:
- firststagesampling # (our current tail-sampling processor?)
- spanmetrics
- batch

exporters: 
- otlp

connectors:
- secondstagesampling

pipelines:
- traces:
  - receivers: [otlp]
  - processors: [firstagesampling, spanmetrics]
  - exporters: [secondstagesampling]
- traces/export:
  - receivers: [secondstagesampling]
  - processors: [batch]
  - exporters: [otlp]

I'm not sure I would use the current tail-sampling for that.

garry-cairns commented 1 year ago

I like the pipeline design, and would likely use it, but couldn't we just use the existing routing connector with the first stage sampling decision as the criterion on which it's routing? (this may have been your intent but it wasn't clear to me so let me know)

jpkrohling commented 1 year ago

The idea is that the first stage sampling will appropriately mark the root spans with the sampling decision and the second stage sampling will effectively sample out the traces that were not marked as selected. While the routing connector has some of the same features (filter out data that is not relevant for the pipeline's specific exporter), I think having sampling in two stages will have a better user experience.

github-actions[bot] commented 11 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

garry-cairns commented 9 months ago

I've got some capacity just now so I'm going to have a go at implementing this.

github-actions[bot] commented 7 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 5 months ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.

github-actions[bot] commented 2 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 2 weeks ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.