Closed jmacd closed 3 years ago
@jmacd
This makes sense. It makes me wonder, though, if we have a way to generate the correct randomness, what's really keeping us from using random bits in TraceID? If we have random bits in the TraceID, the number-of-leading-zeros test can be applied directly, right?
Yes, if the trace ID is specified to be truly random, it would be possible to use it without hashing.
...and I gather that the answer is -- it requires less randomness with your proposal. In the version where we add a single byte to convey randomness, the expected number of random bits required is 2 in this case (for the reader, I believe we're talking about the mean of a https://en.wikipedia.org/wiki/Geometric_distribution with p=0.5). Can you confirm?
Yes, the given pseudocode generates a geometrically distributed random number with p=0.5 and takes just 2 random bits on average.
What I was trying to say is that all samplers (AlwaysOn
, AlwaysOff
, Parent
, and TraceIDRatio
) can be implemented using the same logic as the TraceIDRatio
sampler, provided that the sampling rate of the parent is propagated. They only differ how the sampling rate is chosen for each span. For AlwaysOn
it is 1, for AlwaysOff
it is 0, for Parent
it is the parent sampling rate, and for TraceIDRatio
it is the fixed sampling rate. One could therefore say that a sampler should only be responsible for the selection of the sampling rate. The sampling decision is then always made according to the trace ID ratio logic.
To better reflect that in the interface, I would even redefine the sampler interface which is currently a function mapping to a boolean indicating the sampling decision. If it is defined as a function that maps each individual span to the corresponding sampling rate, the sampler implementations would be very simple. For example, TraceIDRatio
would just return its fixed sampling rate and Parent
would return the sampling rate of its parent. The final sampling decision would be made outside of the sampler following a specified and fixed logic based on the shared random number (or trace ID) and the sampling rate only.
The proposed redefined sampler interface would have following benefits:
TraceIDRatio
based on span attributes, are easy to implement and would not break estimation algorithms.@oertl Thank you. I definitely understand your proposal. You've described a simplification in the Sampler interface, which is an SDK specification topic that I was hoping to avoid. I agree with your summary, the overall Sampler outcome would be easier to reason about with the change you suggest, but I am not sure that legacy users of the API will agree. The amount of energy it will take to make this happen may exceed the benefit produced.
I am trying to keep these OTEPs separate:
Changing the Sampler API would be another issue entirely.
I will revise this PR based on the discussion in https://github.com/open-telemetry/oteps/pull/168#discussion_r677859626, which means combining the two pieces of additional information that go into the traceparent
to both include the head probability and the randomly distributed value, as they are both apparently needed to complete a functional span-to-metrics pipeline.
Personally I would prefer to separate the propagation proposal from the W3C format discussion. Whatever happens with this OTEP is not going to have any (formal) impact on changing the W3C spec, the discussion would need to happen there. On the other hand, the actual functionality is not dependent on changing the W3C format because it can be achieved by using tracestate
, which would be completely in the hands of OTEL & if adopted & successful would serve as a strong motivation for upgrading the traceparent
spec.
@oertl and @yurishkuro
This makes recording the sampler name (which would introduce additional overhead and which was mentioned here -- #148) obsolete.
In my proposal, in the text of #170 (now), the sampler.name
attribute is only a MUST when the adjusted count is unknown, which would only result from legacy trace contexts (if we succeed). I see no reason to record the sampler.name
otherwise, especially if as you suggest the TraceIDRatio
decision is used consistently.
@yurishkuro Thank you for your guidance. I have revised this proposal only to use tracestate
.
@oertl I have revised this to include your proposal with a syntax like:
tracestate: otelprob=PPRR
where PP is the base16 probability value and RR is the base16 randomness value. Please take another look.
I have prototyped changes described in this OTEP here. I was able to make this change without modifying the OTel-Go Sampler
API.
The prototype raises several issues around specifying how to handle errors, and I think we can specify the behavior of errors in such a way that consumers get an unambiguous signal about sampling.
The greatest question the prototype raises for me is how to handle zero probability. In the prototype, I've placed a limit on the range of the geometric random variable with (as in @oertl's paper) the value 0 representing a span that meets every sampling threshold, 1 representing a span that is sampled 50% of the time, and so on. I believe we should limit this value to a reasonable number--I used 64 as the maximum value of this number and I think it best if we let this maximum value represent zero probability. This makes 63 the greatest value, which corresponds with an adjusted count of 2^63. The question this raises for me is how to handle the boundary condition--in practice it probably does not matter, but how should a Sampler behave if it sees a string of 64 leading zeros? (Should it reject the trial and start again? I think this is correct.)
To summarize: random value 64 represents a probability of zero, not of 2^-64. @oertl what do you think?
@jmacd I would limit the smallest sampling with 1/2^62 as 2^62 is the largest power of two that can be represented with a signed 64-bit integer (important to represent the adjusted count in languages like Java which only have signed data types). As a consequence, the exponent of the sampling rate X could be encoded with just 6 bits. The value X=63 is still free and can be used to represent a sampling rate of zero.
The number of leading zeros of a 64bit random number gives values from 0,1,2,...,64. I would trim to values 0,1,2,...,62 such that 6 bits are also sufficient to encode the propagated random number R. Here are the possible values with corresponding probabilities
Hence, a span is sampled if X <= R. If X=63 sampling will never occur, because the maximum value of R is 62.
I've updated this OTEP after all the feedback, mentioning the prototype and the discussion in this week's W3C meeting.
While reviewing the Span
protocol for OTEP 170 I realized that Span
encodes its tracestate
For the proposed otel
vendor key within tracestate
here, I would specify to DROP that entry (or at least the r:
and p:
sub-fields) before serializing a Span
as those values are not directly useful without knowing which Sampler was used.
I have updated this PR to support a minimum sampling probability of 2^-61. This allows the Span field described in #170 to use six bits of information, not 7, to encode the additional Unknown probability value.
Discussed in the Sampling SIG:
traceparent
version 1 supports this with ~3 bytes per context.I revised this OTEP with more content and more examples, following discussion that took place in https://github.com/open-telemetry/opentelemetry-specification/pull/1899.
No significant changes have been made, only correcting off-by-one errors now that the p
value of 0 indicates Unknown probability.
I added specification for several corner cases, suggesting that p
can be omitted when not sampled and that p
should not be respected when r
is not set, for example.
The final version of this text uses:
s
where sampling probability is 2**s
p
where 0 is unknown, 63 is zero, and 1-62 are known non-zeros
r
in the range [0, 61]
and the sampled flag equivalent to p-1 < r+1 && p < 63
. I think this definition makes the formulas most readable, since s
has a natural form, p
is the same as for the Span's log_head_adjusted_count
(from OTEP 170), and r
equals the number of leading zeros in a random bit string.
Please take a careful look @oertl, thanks!
@open-telemetry/specs-trace-approvers please take a look. The feedback has been addressed and the corner cases are explicitly spelled out. Thank you @oertl and @reyang.
@open-telemetry/specs-trace-approvers
This OTEP has received a lot of review and we are down to a number of questions that I feel should be resolved as we move this work into the specification--this PR and #170 could merge. The remaining questions:
If you agree that these are the only remaining questions, please review and approve both OTEPs and we can move on to debating these topics.
@open-telemetry/specs-trace-approvers as announced in the Monday maintainer's meeting, this will proceed next week unless objections are raised here. There was one lengthy review comment thread (https://github.com/open-telemetry/oteps/pull/168#discussion_r704955698) that I resolved by opening a PR to revise OTEP 170 with more detail on how to compose Samplers, please see #175.
@jmacd please fix the check-errors.
One big question that I don't get the answer from this OTEP is if this is a required feature.
Discussed in "Default behavior". I'd prefer this were on-by-default, but with the present tracestate
solution it's expensive. Maybe after there's a W3C traceparent, where this costs 1-3 bytes per context, this can be on by default. I'll accept off-by-default for now-- Lightstep's OTel launchers would turn this on by default. I imagine "yet another" environment variable to say whether the TraceIDRatioBased
sampler generates r
or not.
@jmacd where should we discuss/agree if this is or not the default behavior?
@bogdandrutu Just FYI there's a SIG call on 8 minutes, although don't know who package are you today ;)
default behaviors will be discussed in specs
Yes. FWIW, I've already changed this OTEP to state that the default will be opt-in. We're just going to debate the name and form of the option.
This OTEP specifies how to propagate head trace sampling probability, so that child spans can be counted on the fly. This specification would allow
ParentBased
samplers to output sampling probabilities in a way that supports span-to-metrics pipelines, for example.This OTEP is paired with #170 which discusses additions to the span data model.