How to achieve consistent sampling across linked traces?

kalyanaj commented 1 year ago

Filing this issue per our discussion in the Sampling SIG today.

What are you trying to achieve? OpenTelemetry supports Span Links that can be used to model asynchronous scenarios or batched operations (fan-out/fan-in). I am looking to achieve some level of consistent (head-based) sampling of all the linked traces. If the sampling decision happens at an individual trace level, customers cannot understand the whole story of what happened to a request.

Example of links usage: One use-case is in a producer - consumer scenario where a producer span (say Trace T1 / Span S1) enqueues a job to a queue; let's say such jobs are processed by a consuming service asynchronously. Since the lifetimes of the producer and consumer are different, the consuming operation is modelled as a separate trace (T2 / S2) that links to T1 / S1 using span-links. If there's a way to do consistent sampling across links, then if T1 was sampled then T2 also should be sampled.

What did you expect to see? Guidance / samples / out-of-the-box sampler to help achieve the above. For example, something like:

if you are using parent-based sampling & want to get consistent sampling across links, this is what you need to do.
if you are using consistent-probability sampling & want to get consistent sampling across links, this is what you need to do.

Additional context.

One way the above scenario could be achieved is with a custom sampler that checks if any of the linked spans (of the span for which the sampling decision is being made) is sampled, and if so decide to sample this as well. This can work when the source of this link is the root span of a new trace.
On the other hand, if the source of the link is not the root span, it may need to consider its parent's decision or its links' decision to arrive at its decision. Yes, it will be a partial trace, but having a partial trace here might be better than no trace.
Need to understand the implications for the adjusted count etc.
There would be other trade-offs to consider as well: e.g., if a span is sampled because one of its 20 links is sampled, this span could have a higher probability of always being sampled (since its probability of being sampled = prob(link1 being sampled) + P(link2 being sampled) + ... + P(link20 being sampled)) so need to consider if additional probabilistic measures are needed for the link sampling (credit: @pyohannes).

carlosalberto commented 1 year ago

cc @jmacd

cijothomas commented 1 year ago

One way the above scenario could be achieved is with a custom sampler that checks if any of the linked spans (of the span for which the sampling decision is being made) is sampled, and if so decide to sample this as well

https://github.com/open-telemetry/opentelemetry-dotnet/pull/1851/files It was in .NET originally, but was removed as it was not something spec covered at that time.

jmacd commented 1 year ago

@kalyanaj Thank you for posing these questions.

I would like to separate questions about span Links being created after the start of a span into a separate topic which may interest @pyohannes, below.

About `ShouldSample()` and Span Links

Note that the present OTel-specified mechanism for probability sampling uses tracestate and that OTLP Span Links include the tracestate of the linked context. This means each context independently encodes its adjusted count.

We can describe a non-probability Sampler that decides to sample if any of links are sampled. The new span's trace state will have no r-value or p-value, but the linked-to contexts each may define a r-value/p-value and taken from the perspective of any of those contexts, the new span may be considered representative variously depending on the p-value of the linked-to context.

We can also give the new span an independent probability of being sampled on its own (root or non-root) using the consistent scheme already specified.

We can combine the non-probability and the probability samplers as specified in the composition rules, which states, particularly, that if the Span would not be sampled probabilistically but is recorded for any other reason, it should use p-value 63 which signifies zero adjusted count.

I understand these statements do not quite answer your question, but it is by design. If you want to inherit a probability sampling decision from a parent context, then you may continue as its child, otherwise new contexts require new probability sampling decisions. In a scenario where you sample each link at 1/2 and you have 10 links when starting a new span, the probability of all of links being sampled is 2^-10, for example.

About Span Links outside of `Start()`

The current OTel trace API allows span links to be given only when a span starts. This, we believe, is a rule so that sampling decisions can be made based on all the links and we suspect this has to do with establishing trace "completeness".

The probability sampling specification makes a recommendation to use non-descending probabilities from root to leaf in a trace, to ensure completeness, because of asymmetry. We know when there is a missing parent but not when there is a missing child, so we recommend children not to use a lower sampling rate than their parents, so that (because of sampling "consistency") traces are either complete or recognizably incomplete.

We have a similar situation with span links -- we know when the span link for a sampled span was unsampled ("missing"), but if the linked-to context is sampled and the new span is not sampled, we have an analogous problem -- there is nothing to inform the sampled span that it is missing a link from the unsampled new span. In this scenario, where span links are used, we have no way to ensure completeness.

The fact that we cannot ensure completeness is by design, but the fact that we cannot recognize incomplete traces is a defect. Moreover, we have this defect with or without the ability to add links after a span starts, because we have no way to inform a linked-to context that a link was established.

The problem, I believe, is that we are treating span links as having a single direction. If we had a field to represent the direction of the link, then when a new span starts it records links directed TO a number of other spans while each context linked-to would have a new link directed FROM the new span. In this case, if either side is sampled we will be able to detect a link to the other possibly-sampled context. Having a span link direction field would also allow us to support span link creation after span start, because when a linkage is potentially recorded due to sampling on either side, we will be able to at least establish that an unrecorded connection exists.

pyohannes commented 1 year ago

If we had a field to represent the direction of the link, then when a new span starts it records links directed TO a number of other spans while each context linked-to would have a new link directed FROM the new span. In this case, if either side is sampled we will be able to detect a link to the other possibly-sampled context.

If there is a producer publishing a message to a topic, it cannot know how many consumers are subscribed to the topic and are processing the message. In case some of the consumer traces aren't sampled, I don't see how directions on links would help.

lmolkova commented 1 year ago

I agree with @jmacd here - assuming we deal with a relatively high number of links, the question is what approach would maximize the number of complete groups of traces, but it'd be impossible to achieve full consistency.

This perspective also helps with links after start discussion. It's up to sampler to maximize consistency, but since it's impossible to achieve it anyway, we should allow adding links after start (with direction or without it)

jmacd commented 1 year ago

@pyohannes I apologize for the confusion--The idea didn't fully address the problem, as I realized from a discussion we had about this issue in today's Sampling SIG.

I was trying to establish that Sampling as we know it, where new spans make a sampling decision somehow dependent on their parent context and the span contexts they are linked with at creation, is the reason why we do not support creating Span links after span start. The idea is that because a Sampler has access to the sampled flag of its parent context and other preceding (linked-to) contexts, then we have these capabilities:

We can ensure completeness by making the right Sampler decision
We can verify completeness when reviewing Span data.

The reason we prohibit creating span links after creation is because it breaks one or both of these. What we have is a situation where a link between spans must be recorded by the later-in-time span; the only way we have to control recording a span is in the sampling decision, therefore span links must be present at the time of sampling.

The creation of a span link after span start breaks the two requirements as follows. If the linked-to context is sampled, then the only way to make it complete is to record the linked-from span. If the linked-from span is already not being recorded because the sampling decision has passed, it becomes impossible to record the link. We have unverifiable incompleteness because the linked-to span has no awareness of the linked-from span, which was not recorded. The problem scenario, to be concrete, is a call to add a span link when the linked-to context is sampled and the linked-from span is a no-op span. We have nowhere to record the link.

The OTel Sampler API returns currently one of four states, described here: https://opentelemetry.io/docs/reference/specification/trace/sdk/#recording-sampled-reaction-table. To address both @kalyanaj's original question and support span link after creation, we need a new Span reaction that is a "conditionally recorded" span. A conditionally recorded span is one that is not itself sampled and is being held in memory, recording events and potential after-creation span links. When an after-creation span link occurs linking to a sampled span context, the conditionally recorded span would change states, entering a new state "exported-unsampled" where the span is passed to the exporter despite being unsampled. (If the span was also being probability sampled, the exported-unsampled spans MUST be assigned zero adjusted count.)

Then, to configure a Sampler that would ensure consistent, complete spans including their span links:

If the prevailing Sampler (root or parent-based) decisions to sample, sample as usual.
Otherwise, ShouldSample would return either conditionally-recorded or unsampled-exported to allow for recording the span to complete other contexts. Conditionally-recorded meaning that none of the at-creation-time span links were sampled, but potentially future span links will trigger export. Sampled-exported decisions meaning that at least one of the at-creation-time span links was already sampled.
The probability sampling composition rules explain how to combine 1 and 2.

I hope this sketch is more complete! I didn't actually add a direction attribute to Links, I just require them to be recorded when either side is sampled, for completeness. The need for a new "exported-unsampled" Sampler decision is required even without support for adding span links after creation (to @kalyanaj's point). The need for a new "conditionally-recorded" Sampler decision would be required to support span links after creation (to @pyohannes's feature request).

yurishkuro commented 1 year ago

@jmacd btw, some Jaeger SDKs utilized a state similar to "conditionally-recorded" (we called it deferred sampling), to support sampling based on span attributes that become available only after span start. It's a bit of a kludge, because the state only makes sense until a child span is created, at which point the sampling decision needs to be finalized.

I am, however, not convinced that sampling considerations are the deciding factor for allowing adding links post creation. The exact same arguments could be made for disallowing span attributes after span creation, yet we allow that. Just because sampling questions become more difficult with post-creation links, it does not negate the fact that there are use cases that can benefit from late links, especially in scenarios that sample everything (e.g. CI or other devexp workflows).

pyohannes commented 1 year ago

The idea is that because a Sampler has access to the sampled flag of its parent context and other preceding (linked-to) contexts, then we have these capabilities:

We can ensure completeness by making the right Sampler decision

We can verify completeness when reviewing Span data.

This is true. However, I think ensuring completeness across linked traces makes you lose another crucial capability: effectively enforcing a fixed sampling rate.

If you make a sampling decision based on links to two upstream spans, both upstream spans sampled with a probability of 10%, you're sampling the span with the probability that at least one of the two upstream spans was sampled. This probability is higher than 10%, and, the more links you have, it approaches 100%.

In cases where there is heavy batching and where there are several layers of links, the actual sampling volume could end up being much higher than what one might expect based on the probability decision at the root.

While this is not to be seen as an argument for adding spans link after span creation, I think it illustrates that probably not all capabilities we intend to provide can be fully utilized at the same time, but there might be trade-offs based on usage scenarios.

jmacd commented 1 year ago

I think it illustrates that probably not all capabilities we intend to provide can be fully utilized at the same time, but there might be trade-offs based on usage scenarios.

I agree. We can't avoid the fundamentals of sampling.

What we can do is provide new Sampler implementations that give users a choice. If users would like to record a span that is linked-to by others, they should be able to do so without causing entire other traces to be collected. If that capability will co-exist with what we have today, it means two new Sampler decision codes as I outlined above, one to say "maybe record this span, depending" and one to say "record an untraced span".

jmacd commented 1 year ago

@yurishkuro About "deferred sampling" thanks for explaining. Comparing the two span states that I described with the one from Jaeger, the "deferred sampling" state of Jaeger is similar to but different than the one I called "conditionally recorded", because you could remain in a conditionally recorded state after the first child up until span end because, at any moment, a new span link could appear and cause the span to become "unsampled exported".

Using the Jaeger term "deferred" instead of "conditionally" would give us a complete list of span states:

sampled (implies recorded and exported, an existing spec)
recorded, deferred sampling (implies no children yet, can still enter state 1, 3)
recorded, deferred exporting (implies not sampled, can still enter state 4, 5, or 6)
recorded, exported (implies the span will export when it ends, a new state to support late span links)
recorded, not exported (implies no desire to export the unsampled span, an existing feature to support e.g., z-pages, an existing spec)
not recorded (an existing spec)

So, it looks like three new states if you combine Jaeger's deferred sampling decision with the deferred exporting decision requested to support span links after start.

For us to adopt this kind of support in OpenTelemetry will require prototypes, in case anyone is wondering what are the next steps. Interested parties should look at https://github.com/open-telemetry/opentelemetry-specification/issues/2179, too.

open-telemetry / opentelemetry-specification