Open AliArfan opened 2 weeks ago
Pinging code owners:
processor/tailsampling: @jpkrohling
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Hi @AliArfan - Looking at the docs, there is a decision_cache
option that remembers sampling decisions for a given trace ID beyond the decision_wait
duration - is that something you could use for this purpose?
Hi @bacherfl
Thank you for your quick response!
I just took a look at the docs, and this is what I found about the decision_cache
:
decision_cache (default = sampled_cache_size: 0): Configures amount of trace IDs to be kept in an LRU cache, persisting the "keep" decisions for traces that may have already been released from memory. By default, the size is 0 and the cache is inactive. If using, configure this as much higher than num_traces so decisions for trace IDs are kept longer than the span data for the trace.
Per my understanding, this is to save trace decision after it has been released from the memory. My problem is that for some traces the decision is made too early for the application's edge cases(before we receive an error). I would not like to increase the decision_wait
as it would result in a slower processing time overall. Thus, I was looking for something that might let us process traces after the lifetime of the trace is finished from the application side. For example, a policy that lets us sample on trace_complete
.
If I can use the decision_cache
to alter the decision after I have received the spans with the same trace_id
that would be great! For example, we have made a decision to not sample this trace, but we receive an error span with the same trace_id
after the collector has made the decision. Then if we could get the trace from the cache and alter the decision, and export it we would reach our desired behavior.
Hi @bacherfl
Thank you for your quick response!
I just took a look at the docs, and this is what I found about the
decision_cache
:decision_cache (default = sampled_cache_size: 0): Configures amount of trace IDs to be kept in an LRU cache, persisting the "keep" decisions for traces that may have already been released from memory. By default, the size is 0 and the cache is inactive. If using, configure this as much higher than num_traces so decisions for trace IDs are kept longer than the span data for the trace.
Per my understanding, this is to save trace decision after it has been released from the memory. My problem is that for some traces the decision is made too early for the application's edge cases(before we receive an error). I would not like to increase the
decision_wait
as it would result in a slower processing time overall. Thus, I was looking for something that might let us process traces after the lifetime of the trace is finished from the application side. For example, a policy that lets us sample ontrace_complete
.If I can use the
decision_cache
to alter the decision after I have received the spans with the sametrace_id
that would be great! For example, we have made a decision to not sample this trace, but we receive an error span with the sametrace_id
after the collector has made the decision. Then if we could get the trace from the cache and alter the decision, and export it we would reach our desired behavior.
Thank you for clarifying @AliArfan! I see, in that case the decision_cache
would only work if the trace previously had an error state and was sampled before - In case the error state is only reached after the decision wait time, according to my understanding increasing the decision_wait
would be the workaround for now.
Regarding the introduction of the trace_complete
, I will have to refer to the code owners of this processor, to get their opinion on if this could be done - FYI @jpkrohling
Thank you @bacherfl! Now I know that increasing decision_wait
is our only option for now. Looking forward to the response from the devs :)
Component(s)
processor/tailsampling
Describe the issue you're reporting
Hi,
I am fairly new to the tail sampling processor, but I would like to ask if there is a solution to my use case. After reading the documentation and looking at examples online, my only viable option seems to be increasing the
decision_wait
time.Problem Statement
We have a gRPC collector that processes each message from Cisco devices. We have leveraged OpenTelemetry to gain insights into the application's health. However, we noticed that during a month, we produce 20GB of data. Therefore, we would like to use tail sampling to minimize the sampled data and only sample on
probabilistic
andstatus_code: ERROR
.The problem arises when the
decision_wait
time is reached due to errors in our application where we have a retry and backoff mechanism. For example, we try to re-publish the message to RabbitMQ if it fails, with an increasing backoff interval. Thedecision_wait
set in the tail sampling processor would be too short to include all the retry spans.Is there a way to sample all the error spans on retry, even after the
decision_wait
time has been reached?It would be nice if there were a
trace_start
andtrace_end
we could use to only process traces that are complete.Thank you!