honeycombio / refinery

Refinery is a trace-aware tail-based sampling proxy. It examines whole traces and intelligently applies sampling decisions (whether to keep or discard) to each trace.
Apache License 2.0
275 stars 86 forks source link

Find a way to combine rules sampler with throughput sampler #525

Open kentquirk opened 1 year ago

kentquirk commented 1 year ago

Is your feature request related to a problem? Please describe.

A customer is using a rules-based sampler, but is running into bursty situations where it's overwhelming their desired input volume and then Honeycomb is rate-limiting them. Raising rate limits helps, but there's value in allowing the sample rates in the rules to move in response to actual throughput.

A feature like this could be part of allowing the throughput sampler to operate based on cluster throughput rather than individual instances.

Describe the solution you'd like

An idea is to allow the rules-based sampler to have a multiplier value (we'll call it throttle) that is normally 1, but could be increased to a larger value by a throughput sampler. If throughput exceeds the defined maximum, the throttle would be increased to the ratio between the active throughput and the desired throughput. The throttle is applied as a multiplier to the values in the rules -- provided those values are already greater than 1. (The value 1 implies that the rule means "keep every one of these" and thus any trace conforming to this rule should be kept. )

Example: suppose you're sampling, say, http.status 200s at a sample rate of 1000, and 400s at 10 and 500s at 1 -- and then this knob gets turned up by 33% -- you'd be sampling at 1333, 13, and 1.

The system should have some hysteresis to avoid fiddling with the throttle all the time.

Also, if Honeycomb returns a 429 (rate limit), the throttle should immediately be increased.

Describe alternatives you've considered

It's possible to increase the rate limit on a per-customer basis, but this solution would allow customers to more easily stay within their existing rate limits. Done right, this could be a recommended feature of rules-based sampling.

Additional context

kentquirk commented 8 months ago

This has been punted into the next release too many times, and it just came up in Slack again (internal link). Let's try and make this one happen.