jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform
https://www.jaegertracing.io/
Apache License 2.0
20.47k stars 2.44k forks source link

Cannot use rate limiting lower than 1 trace per second #2543

Open linjmeyer opened 4 years ago

linjmeyer commented 4 years ago

Describe the bug Rate limiting sampler strategy param is an integer. Setting a decimal param value does not work, forcing a minimum rate of 1 trace per second.

To Reproduce Steps to reproduce the behavior:

  1. Set a sampling strategy of "ratelimiting" with a param of "0.5" for a service
  2. Make a GET call to the sampling endpoint (e.g. using curl), it will return {"strategyType":"RATE_LIMITING","rateLimitingSampling":{"maxTracesPerSecond":0}}

Expected behavior I would expect to be able to specify a rate using a decimal, for example 0.5 traces per second

Screenshots If applicable, add screenshots to help explain your problem.

Version (please complete the following information):

Additional context The problematic code seems to be here, where Param is cast to an int16.

I have filed as a bug report because I have checked several clients and they seem to support decimal values:

yurishkuro commented 4 years ago

Unfortunately, the schema dictates an int:

https://github.com/jaegertracing/jaeger-idl/blob/52fb4c944067f7661e6a5fa23ba4c44c6f9c2923/thrift/sampling.thrift#L30

linjmeyer commented 4 years ago

I see, I didn't know that Thrift repo existed. Could we convert this to a feature request or should I open a new one?

jpkrohling commented 4 years ago

Sorry if this is obvious, but why do you need to sample less than 1 trace per second? Do you have a huge number of endpoints?

yurishkuro commented 4 years ago

We used <<1.0 rate quite a lot, because some API services have 1000s of instances, especially with event loop based languages where it's common to run one process per core. However, we used it for defaultLowerBoundTracesPerSecond, which is already a double.

https://github.com/jaegertracing/jaeger-idl/blob/52fb4c944067f7661e6a5fa23ba4c44c6f9c2923/thrift/sampling.thrift#L45

linjmeyer commented 4 years ago

We use Jaeger in almost all of our deployments, something like 500-800 containers running Jaeger. The higher being on the weekends typically when we see more traffic. I think (up to) 800 traces per second is too high. Even ignoring the scaling/storage costs, we just don't need that many as we don't have the head count to look at all that data.

yurishkuro commented 4 years ago

why don't you use probabilistic sampling?

linjmeyer commented 4 years ago

Sorry for the confusion, we are using probabilistic. These are REST APIs (ASP.NET), so 0.6 ends up with ~60% of incoming HTTP requests being sampled for example.

The per operation sampler you linked sounds perfect! Being able to set upper/lower bounds would be great, 60% during peak times is a lot, while theres room to spare during lower traffic periods.

yurishkuro commented 4 years ago

Upper-bound is not supported in the per-operation sampler because it breaks the ability to extrapolations full traffic measurements from the samples. But since Jaeger doesn't make use of those at the moment anyway, and if you don't care, then implementing the rate limiting upper bound sounds like a useful feature. For people who care about extrapolations, the upper bound can be an optional feature.

jpkrohling commented 4 years ago

Got it, that's pretty much the only use-case I could think of.

In any case, is networking an issue for you? You can always down-sample (--downsampling.ratio) at the collector if your concern is only storage.

linjmeyer commented 4 years ago

I'd rather avoid dropping spans on the collector if possible. We log the Trace ID on each service if it is sampled. Our devs know if they see an error or log message in Loki with a TraceID, they can go to Jaeger and look at the exact request. If we are downsampling we lose the ability to accurately do that as the client doesn't know the collector dropped the trace.