Open LucioFranco opened 5 years ago
@LucioFranco in your use-case, what are the motivations for rate-limiting? Is it intended to improve performance by eliding certain events, or just to reduce the noise of the logs for a human reader?
@hawkw both, we have found in previous elixir based applications that printing out for each log can actually bring us down and noisy logs in the UX sense.
@LucioFranco have you done any benchmarking of your current rate-limiting implementation in vector vs not limiting events?
The reason I bring this up is that if rate-limiting is intended for performance reasons as well as UX, that places an additional set of constraints on the design. It seems pretty easy to accidentally design a rate-limiting layer that introduces more overhead than just logging all events, depending on what features we decide to support. In particular, if we come up with an implementation that needs to look at the field values of events from a callsite which it has already decided is currently rate-limited, my guess is that there probably won't be a significant perf improvement over not rate limiting, if rate-limiting the event involves synchronization and a syscall to get a timestamp.
@hawkw I do have some basic benchmarks, generally, they show that a very naive implementation doesn't add enough overhead that it will affect application performance. How I see it, rate-limiting for performance reasons should only be used if you expect a large burst of logs where the overhead gets amortized. This should be a conscious decision by the instrumentor.
Some other thoughts: we've currently reserved the ability to add more complexity to tracing_core::Interest
(note that Interest::always()
, Interest::sometimes()
, and Interest::never()
are functions rather than associated constants like Level
's "variants"), so we could also consider implementing sampling and rate-limiting by adding some support to tracing-core
(potentially using a subscriber::Layer
still to actually determine the rate-limited Interest
s). However, this kind of approach would require special care to ensure that we keep the core crate as minimal as possible, and to ensure that the sampling/rate-limiting support would support all possible use-cases.
@hawkw I do have some basic benchmarks, generally, they show that a very naive implementation doesn't add enough overhead that it will affect application performance. How I see it, rate-limiting for performance reasons should only be used if you expect a large burst of logs where the overhead gets amortized. This should be a conscious decision by the instrumentor.
Sure, I'm just thinking that the design decisions (i.e. what kinds of rate-limiting do we want to support? what APIs would we provide?) ought to be guided by performance goals. Do you happen to have numbers for those benchmarks? It would be interested to know what the targeted performance improvements would look like...
@hawkw yeah, I am gonna update my benchmarks with the new Layer
based implementation. So I will get back to you on that.
Friendly ping :)
*Edit, used later incorrectly. Updated code to use Filter (tested)
I don't think it should a part of the library but I did find myself adding a custom log rate limiting layer. Sadly my application is a frequent DDoS attack target and appending to disk often eats up a lot of CPU when they're able to hit an application path that logs a little too much.
Here's the code if other's want to copy paste
fn setup_logging() /* -> ... */ {
let quota = Quota::per_second(NonZero::new(10).unwrap()).allow_burst(NonZero::new(100).unwrap());
let limiter = RateLimiter::direct(quota);
tracing_subscriber::registry()
.with(layer.with_filter(RateLimitLayer {
limiter,
}))
/* ... */
;
}
struct RateLimitLayer {
limiter: RateLimiter<governor::state::NotKeyed, governor::state::InMemoryState, governor::clock::QuantaClock>,
}
impl<S: Subscriber> tracing_subscriber::layer::Filter<S> for RateLimitLayer {
fn enabled(&self,meta: &Metadata<'_>,cx: &Context<'_,S>) -> bool {
self.limiter.check().is_ok()
}
}
A use case we have for https://github.com/timberio/vector is to rate limit certain noisy logs. The way we would like to define them is per callsite and to set a static window where only one event will be enabled per that window.
We currently already have an implementation https://github.com/timberio/vector/tree/master/lib/tracing-limit.
The issue with this is how do we define that that callsite needs to be rate limited and how do we define that window. Currently, the implementation above uses a
rate_limit_secs
field that identifies a callsite as having rate limiting and it also allows us to specify au64
window duration. The benefits with this method are that we can move the callsite around and still continue to have it rate limited.The downsides to this approach are now instrumenting rate limited logs requires knowledge of the specific subscriber setup. In our case here this is fine but for a more general
tracing-rate-limit
crate this will not work.@hawkw has suggested using const metadata values or updating the
Interest
enum to support other types of interest.