jaegertracing / jaeger-client-go

🛑 This library is DEPRECATED!
https://jaegertracing.io/
Apache License 2.0
1.38k stars 287 forks source link

Probabilistic sampler causes trace collisions #582

Closed vprithvi closed 2 years ago

vprithvi commented 3 years ago

Problem

The probabilistic sampler non-uniformly selects traces to be sampled, increasing the probability of trace collisions. The probability of trace collisions greatly increases when using a very small sampling rate.

The probabilistic sampler relies on the fact that traceIDs are 63bit random number, and make a sampling decision without independently generating a random number, and instead checking whether the traceID < (samplingRate * 2^63). Note that only half the traceID, traceID.Low is used for this computation.

The probabilistic sampler computes a SamplingBoundary as follows:

https://github.com/jaegertracing/jaeger-client-go/blob/2cf72ee48d160dfa450b8699742daf0e9b3ecea4/sampler.go#L107 https://github.com/jaegertracing/jaeger-client-go/blob/2cf72ee48d160dfa450b8699742daf0e9b3ecea4/sampler.go#L130

For the purposes of illustration, let's assume the smallest expressible samplingBoundary, 0x1

The probabilistic sampler computes whether a trace is sampled as follows:

https://github.com/jaegertracing/jaeger-client-go/blob/2cf72ee48d160dfa450b8699742daf0e9b3ecea4/sampler.go#L145

As samplingBoundary is 0x1, this selection logic will only ever select traces where the traceID is 0x1.

While the probabilistic sampler select traces probabilistically, the range of traceIDs selected is coupled to the probability.

Because the traceIDs of traces sampled by the probabilistic sampler can only lie in a particular range, we can compute the number of traceIDs required to generate a collision by using the analysis of the birthday problem.

image

where n(p;b) denotes the number of random integers drawn from [1, 2^b] to optain a probability p that at least two numbers are the same.
and b is the number of bits used
and p is the probability of a collision

We see that if the number of bits is 32 (e.g. setting samplingProbability to 0.5), there only needs to be approx 30k traceIDs produced to satisfy a 0.1 probability of collision.

Proposed Solution

The probabilistic sampler could generate the sampling decision independently of the traceID, and thus not bias the range of selected traceIDs.

This will probably cause an increase in cpu utilization from random number generation and we may want to provide a way to revert to previous behavior. As other samplers like PerOperationSampler and GuaranteedThroughputProbabilisticSampler depend on this, the change could be pretty involved.

yurishkuro commented 3 years ago

Similar discussion https://github.com/open-telemetry/oteps/pull/135#issuecomment-842209594