open-telemetry / oteps

OpenTelemetry Enhancement Proposals
https://opentelemetry.io
Apache License 2.0
340 stars 164 forks source link

Non-power-of-two consistent tail probability sampling in TraceState #226

Closed jmacd closed 1 year ago

jmacd commented 1 year ago

This is meant to address https://github.com/open-telemetry/opentelemetry-specification/issues/1413. Follows https://github.com/open-telemetry/oteps/blob/main/text/trace/0170-sampling-probability.md and https://github.com/open-telemetry/oteps/blob/main/text/trace/0168-sampling-propagation.md.

Cc: @oertl @kalyanaj @PeterF778

jmacd commented 1 year ago

@oertl Thank you. I accept your suggestions and wonder, would you be interested in drafting blocks of replacement text for your scheme? I'll be glad to work out the pseudocode snippets from there, but your words for "hex/binary threshold equals number of spans dropped out of 2^56", for explaining the 52-bits vs 56 bits issue, and the use of greater-or-equal, I take it, for filtering. I would be glad to then work on writing a spec to deprecate p-value in favor of t-value, where as you say the threshold "0" equals "00000000000000" indicating all spans with 7-bytes >= "00000000000000".

oertl commented 1 year ago

As alternative to the t-value, the p-value definition could be extended. For example, p:11.a3d could be interpreted as 11 leading zeros followed by a 1 and the bits defined by the extra hex code "a3d". Remaining bits are filled by ones. In this case, the value would be

00000000000    1  1010      0011      1101       111111111111111111111111111111111111     = 0x001a3dffffffff = 28853590294527 

11 leading        hex "a"   hex "3"   hex "d"    32 one bits  to fill up to 56 bits    
zeros

The resulting value corresponds be the number of kept spans out of 2^56 subtracted by 1.

The example above would correspond to a sampling probability of (28853590294527 + 1)/2^56 = 0.000400424

With this definition p:0, p:1, p:2, etc. would correspond to power of two sampling probabilities as already defined.

jmacd commented 1 year ago

p:11.a3d

By the way, in my prototype I used the syntax a3d-11 where you've written 11.a3d (b/c I read this as "a3d shifted 11"). In this draft, to simplify, I proposed to encode the 11 zeros (i.e., 00000000000a3d), but I agree that encoding the zero count separately is more compact for most probabilities.

As for whether we overload how to parse p of not, that's a separate question-- it would break existing uses of p-value. That's why I prefer to create a new variable, but since p is marked experimental I'm open to this change.

kalyanaj commented 1 year ago

fer to create a new variable, but since p is marked experimental I'm open to this change

If p-value is not used currently in any implementations (or if they are okay with a breaking change since it is still experimental as you call out), yes it looks like extending p-value to also encode non-powers-of-two sampling probabilities is a good idea by @oertl.

Reasoning: Since p-value and t-value are conceptually for the same purpose (to encode sampling probabilities), ideally it will be good to just extend the p-value concept - it will make it a bit easier to understand for folks who are already familiar with the purpose of p-value, rather than trying to think of t-value as a new concept.

PeterF778 commented 1 year ago

With respect to obsoleting the r-value, I reported a new issue 3307 today. If we agree that the issue is real (I'm not entirely sure), the solution I proposed would make consistent probability samplers work differently if there's no r-value.

jmacd commented 1 year ago

This was discussed in yesterday's Sampling SIG. Since the initial feedback, I had come to the following rough idea to use all the bits of the TraceID, to consistently decide how to interpolate between powers of two.

Considering an 8-bit example with 75% sampling:

cc @oertl @PeterF778

oertl commented 1 year ago

@jmacd

This was discussed in yesterday's Sampling SIG. Since the initial feedback, I had come to the following rough idea to use all the bits of the TraceID, to consistently decide how to interpolate between powers of two.

Unfortunately, after rethinking this proposal, I have concluded that it would not allow correct estimation of trace quantities (e.g. estimating the number of traces touching one service A and another service B) as described in my paper. It would only work for span estimates (e.g. estimating the number of spans of service A). The proposal violates the basic assumption that the choice of the sampling probability is independent of the shared randomness (trace ID or p-value).

PeterF778 commented 1 year ago

Let's take one step back.

The Consistent Probability Sampling schema already allows consistent head and tail sampling if the sampling probability is a power of 2. Now we want to extend it so that consistency (head and tail) is preserved even for non-power-of-2 probabilities.

It is obvious, I think, that we need to use the random bits of trace-id somehow, if we want to get consistent sampling with any given probability, meaning multiple instances of (head or tail) samplers making the same sampling decisions wrt spans belonging to the same trace. From the past discussions, it looks like we have two categories of possible extensions:

A. Allow (approximation of) any non-power-of-2 sampling probabilities in TraceState.

B. Continue to restrict the sampling probabilities in TraceState to power-of-2 values.

and we can always have

C. Drop Consistent Probability Sampling as it has been proposed and replace it with something else.

Now we could try to summarize the benefits and disadvantages of these approaches. The way I see it, the benefits of A are:

while the potential issues are:

Respectively, approach B is the reverse of that, with the drawbacks being:

and the advantages are:

With respect to C, it is a very wide open field, but let's keep in mind that for expressing sampling probabilities we do not need high precision, we need wide range of values. That's why the logarithmic scale of r-values and p-values is so powerful and efficient.

If we try to replace that with a linear model based on 56 random bits in trace-id, we quickly run out of bits.

kalyanaj commented 1 year ago

Thanks @PeterF778 for consolidating the tradeoffs - overall the summary of the tradeoffs makes sense to me.

I didn't quite understand the below two points - can you please clarify/elaborate?

sets of spans/traces can be re-sampled multiple times using the same mechanism and still encode the resulting adjusted count accurately

and this final point about running out of bits:

If we try to replace that with a linear model based on 56 random bits in trace-id, we quickly run out of bits.

jmacd commented 1 year ago

03f693c Contains an update based on recent discussions.

jmacd commented 1 year ago

@kentquirk new draft: df6b1d0

PeterF778 commented 1 year ago

A very nice and clean proposal! A few questions though: it says it is extending the r-value and p-value proposal - the p-value will be used to calculate the final adjusted count, but is the r-value used anywhere? And an old question remains: why tail sampling requires explicit non-power-of-two sampling probabilities as opposed to (deterministic) interpolation between two adjacent power-of-twos? What do we gain or lose with either approach?

PeterF778 commented 1 year ago

I would like to see how to provide algorithms for tail sampling based on the t-value, for the following use cases:

  1. (static sampling) Given a set of N1 spans, I want to re-sample it to arrive at approximately N2 spans.
  2. (dynamic sampling) Given a stream of incoming spans, I want to re-sample it so I get at most X spans/s (approximately).

The potential challenge I see is that the algorithms may require multiplying t-values and/or sorting the spans according to their t-value.

oertl commented 1 year ago

@PeterF778

I would like to see how to provide algorithms for tail sampling based on the t-value, for the following use cases:

  1. (static sampling) Given a set of N1 spans, I want to re-sample it to arrive at approximately N2 spans.

I believe collecting N2 spans with smallest random value can be done in O(N1) time and O(N2) space without sorting. The idea is to have a threshold T (initial value 1) and a buffer of size N2*(1 + eps) with some constant eps > 0 that stores all spans with random values (interpreted as values from [0,1)) smaller than T. If the buffer is full, the span with the (N2+1)-th largest random value is chosen using Quickselect . This random value defines the new threshold T which has an expected value of E(T) = 1/(1 + eps). All spans with random values >= T get dropped. The remaining N2 spans are kept and put into the first part of the buffer. This compaction step takes O(N2*(1+eps)) time. After that, the buffer has again space for N2*eps new spans. We fill that with the next spans in the data stream having a random value smaller than the threshold T. Since the expected value of the threshold is 1/(1 + eps), it takes N2*eps*(1+eps) spans on average to fill the buffer again, when the next compaction takes place. The new threshold will have an expected value of E(T) = 1/(1+eps)^2. And so on.

Therefore we need

O(N2*(1+eps)) time to process the first N2*(1 + eps) spans O(N2*(1+eps)) time to process the next N2*eps*(1+eps) spans O(N2*(1+eps)) time to process the fnext N2*eps*(1+eps)^2 spans ...

In summary, it takes O(K*N2*(1+eps)) time to process

N2*(1 + eps) + N2*eps*(1+eps) + N2*eps*(1+eps)^2 + ... + N2*eps*(1+eps)^(K-1) = N2*(1+eps)^K

spans. The average processing costs per span are therefore

O(K*N2*(1+eps) / N2*(1+eps)^K) = O(K / (1+eps)^(K-1)) = O(1)

which is constant. I am not 100% sure yet what t-value needs to be assigned to the surviving spans to get unbiased counts. Probably it is sufficent to replace the current t-value, if it is larger, by the threshold T. This Adaptive Threshold Sampling paper might be helpful in this context.

  1. (dynamic sampling) Given a stream of incoming spans, I want to re-sample it so I get at most X spans/s (approximately).

If there was no previous sampling stage, we need to estimate the incoming rate from spans in the past (e.g., with exponential smoothing). Using this estimate and the desired rate, we can calculate a sampling probability that is used as the threshold. If there were previous sampling stages, the random values are no longer uniformly distributed and the actual distribution must be considered when choosing the threshold. For example, to achieve a 50% reduction, the threshold would need to be set to the median of all random values. To obtain a sampling probability of p, one would have to set the threshold to the p-quantile. It might be possible to use techniques like Frugal Streaming to incrementally adjust the threshold.

PeterF778 commented 1 year ago

I'm not sure I understood your algorithm for static sampling correctly, @oertl. Correct me, if I'm wrong, but it looks like the selection of spans to survive is affected heavily by their t-value, rather than the source of randomness (trace-id), which introduces a bias. Suppose that the original set of N1 spans contains a subset S of spans that have been already heavily sampled, so their t-values are very low - much lower than the spans not belonging to set S. In that case, if the size of set S is smaller than N2, all elements of set S will survive the re-sampling process. IMHO, an unbiased algorithm should sample spans from set S roughly with the same chances of survival as those from outside set S.

oertl commented 1 year ago

IMHO, an unbiased algorithm should sample spans from set S roughly with the same chances of survival as those from outside set S.

It depends on what is meant by bias. My understanding is the following: The expected adjusted count is equal to 1. This is ensured by setting the adjusted count to the inverse of the sampling probability. This should be the case with the algorithm described above.

@PeterF778, I think your concern is that the algorithm balances the sampling probabilities of the previous sampling stages. Depending on what you want to estimate, this sampling strategy may or may not be beneficial. I haven't checked, but I think the algorithm is similar to VarOpt sampling (see here), which minimizes variance when estimating arbitrary subset sums. This algorithm makes sense if you have a sampling stage that combines samples collected in earlier sampling stages with different probabilities for no good reason, e.g., due to unbalanced load or short-term load fluctuations.

The situation is different if you have already identified certain classes of spans (e.g., spans with errors) that should be sampled with a higher probability. In this case, you want to sample these spans more frequently than others by purpose. I think the correct term for this is stratified sampling, where weights are defined for different classes of spans and the sampling algorithm tries to sample them so that the ratios of the weights are reflected by the corresponding ratios of the t-values. To complicate matters further, different stages of sampling would define different weights for the same span. Early stages may not be able to assess the importance of a span while later sampling stages have a more holistic view of the trace and therefore might assign a different weight to a span. For example, if the child span has an error.

Anyway, discussion of these sampling strategies takes us a bit off topic. The same issues must be solved when using power-of-two sampling probabilities.

PeterF778 commented 1 year ago

@PeterF778, I think your concern is that the algorithm balances the sampling probabilities of the previous sampling stages. Depending on what you want to estimate, this sampling strategy may or may not be beneficial.

Yes, I imagine "static sampling" to be applied to aging data, after one or many stratified sampling steps were performed .

Anyway, discussion of these sampling strategies takes us a bit off topic. The same issues must be solved when using power-of-two sampling probabilities.

If we want to compare the two competing consistent probability sampling mechanisms, we have to understand what they will entail in all processing stages. I believe the re-sampling algorithms will be very similar in principle, but there could be differences in complexity or accuracy of the results.

oertl commented 1 year ago

If we want to compare the two competing consistent probability sampling mechanisms, we have to understand what they will entail in all processing stages. I believe the re-sampling algorithms will be very similar in principle, but there could be differences in complexity or accuracy of the results.

Yes, this is true. For example, for "dynamic sampling" it is probably much simpler to estimate quantiles if there is a power-of-two discretization as you could simply aggregate into a histogram as there are only a small number of relevant values. However, any sampling stage is free to use just power-of-two sampling probabilities, if it is more efficient. Even, if spans come with non-power-of-two t-values, they could be easily downsampled to the next power of two in a first step.

jmacd commented 1 year ago

New developments discussed in the Sampling SIG today.

We proposed "s-value" as a mechanism to encode the accumulation of independent non-consistent sampling stage adjusted counts. t-value and s-value would be separate fields, both included in tracestate for consistency. Existing vendor-specific sampling probabilities with unknown-and-presumed-independent sampling mechanisms will encode probability or adjusted count (as with t-value encoding) using tracestate s-value.

In the coming week I will resolve the conversations above. The plan is to draft a proposed change to https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/tracestate-probability-sampling.md and document/justify the changes in this OTEP. Most of the existing specification will be re-used.

jmacd commented 1 year ago

I've updated the document making "r-value" optional (and 56-bits) and making "s-value" for independent non-consistent sampling. I've incorporated some helpful feedback from @kalyanaj. Thanks.

kentquirk commented 1 year ago

@jmacd , @kalyanaj , @oertl , @PeterF778

As promised in the SIG meeting, I have implemented a proposal for how to calculate threshold, sampling rate, and sampling probability in several different languages (Go, JavaScript, and Python).

The proposal is simple -- here's the Go version:

func threshold(tValue float64) int64 {
    const k = 0x1p+56
    if tValue < 1.0 {
        return int64(k*tValue + 0.5)
    }
    return int64(k/tValue + 0.5)
}

func samplingRate(tValue float64) int64 {
    if tValue < 1.0 {
        return int64(1.0/tValue + 0.5)
    }
    return int64(tValue + 0.5)
}

func samplingProbability(tValue float64) float64 {
    if tValue < 1.0 {
        return tValue
    }
    return 1.0 / tValue
}

All 3 languages seem to deliver identical results in the first 2 functions.

I was unable to convince all 3 languages to format floating point values identically (I probably could have by installing various libraries but I wanted to use the standard libraries). But they seemed to be identical through the first 14 digits.

Thoughts?

oertl commented 1 year ago

I wonder if we should prefer a canonical lossless (e.g. hex-encoded and only values less than or equal to 1) representation of the t-value that does not depend on platform- or language-dependent behavior (like the r-value). This would make encoding and parsing much simpler and less costly. For scenarios with multiple sampling stages, the t-value needs to be parsed and encoded frequently, and therefore this should be done as efficient as possible.

kentquirk commented 1 year ago

I understand your point, but I'm personally much more concerned about human usability. What I see from users in real sampling configurations are sampling rates like 1 in 10, 1 in 6, and 1 in 1000.

rate hex value < 1
10 0x1.999999999999ap-4
6 0x1.5555555555555p-03
1000 0x1.0624dd2f1a9fcp-10

All of those representations lose the user's intent, while 10, 6, and 1000 do not.

There is also the problem that as far as I have found, many languages do not support hex representation in floating point. Or if they do, they do so differently.

I strongly feel that we should bias toward comprehensible and easily implementable human representations.

oertl commented 1 year ago

I understand your point, but I'm personally much more concerned about human usability. What I see from users in real sampling configurations are sampling rates like 1 in 10, 1 in 6, and 1 in 1000.

I also understand your point. But the limited number of bits simply does not allow to sample exactly 1 out of 3 consistently. In my opinion, it is more important to forward the actual sampling probability applied, not the user's intent. If you know that all your sampling configurations are of kind "1 out of X", it is relatively easy to reconstruct X from the applied sampling probability to make the user believe that it really was sampled as originally intended. Maybe an additional flag indicating that the reported t-value comes from a "1 out of X" rule could be a compromise.

In large systems, sampling probabilities are typically automatically chosen (e.g. based on rate limits), and it is more important that the parsing/encoding overhead is small.

There is also the problem that as far as I have found, many languages do not support hex representation in floating point. Or if they do, they do so differently.

It is easy to specify a lossless hex representation for the t-value. It could be defined as the integer threshold value used when comparing with the 56 random bits of the trace ID (or the optional r-value). If the t-value is an integer, it is straightforward to find a platform/language-independent hex representation. This definition would also reduce floating point operations as the sampling decision is simply the result of comparing the t-value with the random bits.

jmacd commented 1 year ago

I have drafted the introductory text of the changes I would propose in the tracestate-probability-sampling.md spec.

jmacd commented 1 year ago

This OTEP has been well-replaced by #235. Thanks @kentquirk!