tokio-rs / tracing

Application level tracing for Rust.
https://tracing.rs
MIT License
5.55k stars 729 forks source link

Rate limit layer #291

Open LucioFranco opened 5 years ago

LucioFranco commented 5 years ago

A use case we have for https://github.com/timberio/vector is to rate limit certain noisy logs. The way we would like to define them is per callsite and to set a static window where only one event will be enabled per that window.

We currently already have an implementation https://github.com/timberio/vector/tree/master/lib/tracing-limit.

The issue with this is how do we define that that callsite needs to be rate limited and how do we define that window. Currently, the implementation above uses a rate_limit_secs field that identifies a callsite as having rate limiting and it also allows us to specify a u64 window duration. The benefits with this method are that we can move the callsite around and still continue to have it rate limited.

The downsides to this approach are now instrumenting rate limited logs requires knowledge of the specific subscriber setup. In our case here this is fine but for a more general tracing-rate-limit crate this will not work.

@hawkw has suggested using const metadata values or updating the Interest enum to support other types of interest.

hawkw commented 5 years ago

@LucioFranco in your use-case, what are the motivations for rate-limiting? Is it intended to improve performance by eliding certain events, or just to reduce the noise of the logs for a human reader?

LucioFranco commented 5 years ago

@hawkw both, we have found in previous elixir based applications that printing out for each log can actually bring us down and noisy logs in the UX sense.

hawkw commented 5 years ago

@LucioFranco have you done any benchmarking of your current rate-limiting implementation in vector vs not limiting events?

The reason I bring this up is that if rate-limiting is intended for performance reasons as well as UX, that places an additional set of constraints on the design. It seems pretty easy to accidentally design a rate-limiting layer that introduces more overhead than just logging all events, depending on what features we decide to support. In particular, if we come up with an implementation that needs to look at the field values of events from a callsite which it has already decided is currently rate-limited, my guess is that there probably won't be a significant perf improvement over not rate limiting, if rate-limiting the event involves synchronization and a syscall to get a timestamp.

LucioFranco commented 5 years ago

@hawkw I do have some basic benchmarks, generally, they show that a very naive implementation doesn't add enough overhead that it will affect application performance. How I see it, rate-limiting for performance reasons should only be used if you expect a large burst of logs where the overhead gets amortized. This should be a conscious decision by the instrumentor.

hawkw commented 5 years ago

Some other thoughts: we've currently reserved the ability to add more complexity to tracing_core::Interest (note that Interest::always(), Interest::sometimes(), and Interest::never() are functions rather than associated constants like Level's "variants"), so we could also consider implementing sampling and rate-limiting by adding some support to tracing-core (potentially using a subscriber::Layer still to actually determine the rate-limited Interests). However, this kind of approach would require special care to ensure that we keep the core crate as minimal as possible, and to ensure that the sampling/rate-limiting support would support all possible use-cases.

hawkw commented 5 years ago

@hawkw I do have some basic benchmarks, generally, they show that a very naive implementation doesn't add enough overhead that it will affect application performance. How I see it, rate-limiting for performance reasons should only be used if you expect a large burst of logs where the overhead gets amortized. This should be a conscious decision by the instrumentor.

Sure, I'm just thinking that the design decisions (i.e. what kinds of rate-limiting do we want to support? what APIs would we provide?) ought to be guided by performance goals. Do you happen to have numbers for those benchmarks? It would be interested to know what the targeted performance improvements would look like...

LucioFranco commented 5 years ago

@hawkw yeah, I am gonna update my benchmarks with the new Layer based implementation. So I will get back to you on that.

dgrr commented 1 year ago

Friendly ping :)

loriopatrick commented 4 months ago

*Edit, used later incorrectly. Updated code to use Filter (tested)

I don't think it should a part of the library but I did find myself adding a custom log rate limiting layer. Sadly my application is a frequent DDoS attack target and appending to disk often eats up a lot of CPU when they're able to hit an application path that logs a little too much.

Here's the code if other's want to copy paste

fn setup_logging() /* -> ... */ {
    let quota = Quota::per_second(NonZero::new(10).unwrap()).allow_burst(NonZero::new(100).unwrap());
    let limiter = RateLimiter::direct(quota);

    tracing_subscriber::registry()
        .with(layer.with_filter(RateLimitLayer {
            limiter,
        }))
       /* ... */
        ;

}

struct RateLimitLayer {
    limiter: RateLimiter<governor::state::NotKeyed, governor::state::InMemoryState, governor::clock::QuantaClock>,
}

impl<S: Subscriber> tracing_subscriber::layer::Filter<S> for RateLimitLayer {
    fn enabled(&self,meta: &Metadata<'_>,cx: &Context<'_,S>) -> bool {
        self.limiter.check().is_ok()
    }
}