open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.12k stars 2.39k forks source link

[processor/tailsampling] decision_wait time and the lifespan of a trace #36291

Open AliArfan opened 2 weeks ago

AliArfan commented 2 weeks ago

Component(s)

processor/tailsampling

Describe the issue you're reporting

Hi,

I am fairly new to the tail sampling processor, but I would like to ask if there is a solution to my use case. After reading the documentation and looking at examples online, my only viable option seems to be increasing the decision_wait time.

Problem Statement

We have a gRPC collector that processes each message from Cisco devices. We have leveraged OpenTelemetry to gain insights into the application's health. However, we noticed that during a month, we produce 20GB of data. Therefore, we would like to use tail sampling to minimize the sampled data and only sample on probabilistic and status_code: ERROR.

The problem arises when the decision_wait time is reached due to errors in our application where we have a retry and backoff mechanism. For example, we try to re-publish the message to RabbitMQ if it fails, with an increasing backoff interval. The decision_wait set in the tail sampling processor would be too short to include all the retry spans.

Is there a way to sample all the error spans on retry, even after the decision_wait time has been reached?

It would be nice if there were a trace_start and trace_end we could use to only process traces that are complete.

Thank you!

github-actions[bot] commented 2 weeks ago

Pinging code owners:

bacherfl commented 2 weeks ago

Hi @AliArfan - Looking at the docs, there is a decision_cache option that remembers sampling decisions for a given trace ID beyond the decision_wait duration - is that something you could use for this purpose?

AliArfan commented 2 weeks ago

Hi @bacherfl

Thank you for your quick response!

I just took a look at the docs, and this is what I found about the decision_cache:

decision_cache (default = sampled_cache_size: 0): Configures amount of trace IDs to be kept in an LRU cache, persisting the "keep" decisions for traces that may have already been released from memory. By default, the size is 0 and the cache is inactive. If using, configure this as much higher than num_traces so decisions for trace IDs are kept longer than the span data for the trace.

Per my understanding, this is to save trace decision after it has been released from the memory. My problem is that for some traces the decision is made too early for the application's edge cases(before we receive an error). I would not like to increase the decision_wait as it would result in a slower processing time overall. Thus, I was looking for something that might let us process traces after the lifetime of the trace is finished from the application side. For example, a policy that lets us sample on trace_complete.

If I can use the decision_cache to alter the decision after I have received the spans with the same trace_id that would be great! For example, we have made a decision to not sample this trace, but we receive an error span with the same trace_id after the collector has made the decision. Then if we could get the trace from the cache and alter the decision, and export it we would reach our desired behavior.

bacherfl commented 2 weeks ago

Hi @bacherfl

Thank you for your quick response!

I just took a look at the docs, and this is what I found about the decision_cache:

decision_cache (default = sampled_cache_size: 0): Configures amount of trace IDs to be kept in an LRU cache, persisting the "keep" decisions for traces that may have already been released from memory. By default, the size is 0 and the cache is inactive. If using, configure this as much higher than num_traces so decisions for trace IDs are kept longer than the span data for the trace.

Per my understanding, this is to save trace decision after it has been released from the memory. My problem is that for some traces the decision is made too early for the application's edge cases(before we receive an error). I would not like to increase the decision_wait as it would result in a slower processing time overall. Thus, I was looking for something that might let us process traces after the lifetime of the trace is finished from the application side. For example, a policy that lets us sample on trace_complete.

If I can use the decision_cache to alter the decision after I have received the spans with the same trace_id that would be great! For example, we have made a decision to not sample this trace, but we receive an error span with the same trace_id after the collector has made the decision. Then if we could get the trace from the cache and alter the decision, and export it we would reach our desired behavior.

Thank you for clarifying @AliArfan! I see, in that case the decision_cache would only work if the trace previously had an error state and was sampled before - In case the error state is only reached after the decision wait time, according to my understanding increasing the decision_wait would be the workaround for now. Regarding the introduction of the trace_complete, I will have to refer to the code owners of this processor, to get their opinion on if this could be done - FYI @jpkrohling

AliArfan commented 2 weeks ago

Thank you @bacherfl! Now I know that increasing decision_wait is our only option for now. Looking forward to the response from the devs :)