metrics-rs / metrics

A metrics ecosystem for Rust.
MIT License
1.13k stars 158 forks source link

How to use a new handle-based API? #271

Open loyd opened 2 years ago

loyd commented 2 years ago

I'm writing an actor system (elfo) and have been faced with inconvenient usage of the handle API that has been lent in the last metrics release (0.18).

Some prelude before questions.

All metrics exposed by actors are recorded with special labels (actor_group and, if configured, actor_key).

Let's consider some built-in metrics, e.g. the elfo_handling_message_time_seconds histogram, that's measured for every incoming message:

elfo_handling_message_time_seconds_sum{actor_group="some_actor_group", message="SomeMessage", ...} ..
...
elfo_handling_message_time_seconds_sum{actor_group="another_group", message="AnotherMessage", ...} ..
...

This metric is built using a static name and static label set (that generated for each Message) on the fly without registering any metrics before. Also, these labels used somewhere in production code, without explicit storing of any handles. Moreover, every metric can be produced twice (per actor group and per actor), if configured. Because if a group has low actors, it's useful to collect metrics about each of them. And, always, all metrics are collected for whole group.

All emitted metrics are handled by some dedicated actor, which implementation can be altered by user, but default implementation (elfo-telemeter) also provided and aren't based on metrics-exporter-prometheus, because I want more control: compact atomic buckets by interval, provide sliding window, have graceful termination, handle zero counters in a special way, provide API for other actors to get metrics in backend-agnostic structure and so on. By the way, I don't understand, why a such specific implementation as metrics-exporter-prometheus is stored in this repo at all, but it another question, I need only frontend and useful utils (thanks for them).

Now, using 0.17, the metric storage totally implemented in that dedicated actor.

In 0.18, with handle-based API, I don't totally understand how to handle these cases. I need to register all metrics (both, for built-in metrics and specific production metrics), that can have different actor keys (that can be solved with the layer system) and per-message metrics, that requires efficient storage in different places.

And, more important, any metric can be recorded twice (per group and per actor), if configured for a specific actor group.

Thus, I have two problems: 1.Macro calls can lead to exposing one or two metrics (depends on config) with different labels. I see only adding labels in layers.

  1. Even in places, where key and labels are static, I cannot emit metric directly and must register all of them, that leads to placing storages here and there (per message, per log level, per dumping class and so on). Should I use Registry everywhere?
tobz commented 2 years ago

Sorry that you've been having trouble using the new handle-based API in 0.18.

At a high-level, there's not much that changes from the perspective of the recorder:

This means that while in 0.17, updating a metric typically meant using the Registry::op pattern, now you must provide a handle to actual metric storage so it can be updated in accordance with what the metric type is.

Let's pretend that the Registry is just HashMap<(MetricKind, Key), AtomicU64>. If you wanted to update a metric in 0.17, you might pass a closure that gets a reference to the atomic i.e. |&atomic| atomic.fetch_add(1, ...) in order to increment.

In 0.18, Registry looks more like HashMap<(MetricKind, Key), Arc<AtomicU64>>, and when a metric is registered, you are giving back a clone of the Arc<AtomicU64> wrapped in the appropriate handle type (Counter/Gauge/Histogram) which is then used directly, or by the emission macros (increment_counter!, etc).

If you have a specific example of trying to update one area of your code from metrics 0.17 to 0.18 that you can share, I might be able to provide more help. Looking at the code you linked, it's just not clear to me why it couldn't be adapted to 0.18.

loyd commented 2 years ago

Thanks for the fast response!

Let's focus on the problem of exposing several metrics from one place. For every counter!("some_metric") call I want to produce from 0 to 2 different metrics, depending on the configuration for a specific actor group:

some_metric{actor_group="a", ...}
some_metric{actor_group="a", actor_key="1", ...}

Now, this is implemented by reading rare changing the atomic configuration and calling Registry::op several times: click. It works, because the previous version of Registry allowed me to have a custom key.

Let's try to use the new API:

struct Storage(Registry);

impl Recorder for Storage {
    fn register_counter(&self) -> Counter {
        // I cannot just get one metric by calling
        // `get_or_create_counter`, because:
        // 1. Additional labels should be added, that means
        // recreation of `Key`, that's expensive.
        // 2. If configured, I need create two counters instead of one
        // and it can be changed at runtime.
        // 3. At runtime, if configuration is changed, the registration should be called again,
        // that's impossible for users who don't use macros.
    }
}

(3) could be resolved with some analogue of https://docs.rs/tracing-core/latest/tracing_core/callsite/fn.rebuild_interest_cache.html

Ok, let's create a special CounterFn instance, that will do all work at the calling moment instead of the registration one:

struct MyCounter {
    storage: Arc<Storage>,
    key: Key,
}

impl CounterFn for MyCounter {
    fn increment(&self, value: u64) {
        if telemeter_per_actor() {
            self.storage.counter_per_actor(self.key, value);
        }
        if telemeter_per_group() {
            self.storage.counter_per_group(self.key, value);
        }
    }
}

The signature of register_counter forces implementation to return Counter, which means cloning Arc<Storage>. Registration occurs every call of the macro, which means a lot of false sharing on the counter inside Arc. Ok, I can use TLS and have different Arc in each thread to avoid false sharing. Inconvenient, but possible. But in this case, it's impossible to use several storages (for tests or other purposes) and almost equal to having global static Storage.

Also, Key should be cloned also, that can be expensive too in case of Owned labels/name inside Key.

Another, but related, the problem is reusing Registry for custom Key. Now Registry's methods accept metrics::Key instead of custom key implementation. So, instead of just passing some kind of struct { actor_group: ... key: Key }, I need to recreate a new Key with merging labels. Implementation of Registry is nice, I want to reuse it.

tobz commented 2 years ago

Answering out of order:

Another, but related, the problem is reusing Registry for custom Key. Now Registry's methods accept metrics::Key instead of custom key implementation. So, instead of just passing some kind of struct { actor_group: ... key: Key }, I need to recreate a new Key with merging labels. Implementation of Registry is nice, I want to reuse it.

That's a good point about Registry now requiring Key instead of a generic K. I had done some refactoring to be able to memoize the hash of a key to speed up lookups, so I constrained everything to Key to make it easier to enforce. There's probably a better approach I could take to keep it generic while still supporting memoized hashing. I opened #272 to explore finding a better design that lets us have generic keys while still supporting memoized hashes.

For every counter!("some_metric") call I want to produce from 0 to 2 different metrics, depending on the configuration for a specific actor group:

Thanks for this example. It's very clear now what you're trying to achieve and why the new handle-based design makes it harder/inefficient.

One thought in my mind: do actors ever change their labels once active? For example, do actor_group or actor_key change once an actor is running?

The reason I ask this question is that I wonder if you might potentially benefit from the new handle-based design by changing how you create metrics for a given actor. If we assume that the metric key/labels are fixed from when an actor starts until when it completes/dies, then it might be worth trying to actually store those metrics as fields on a struct rather than using the macros.

By doing so, you could avoid all the current logic that figures out if a metric needs to be updated or not. For example:

// All metrics for an actor.  This could also be implemented multiple times if
// you wanted to break down the metric groups so that only certain metrics are
// declared near where they are used, closer to the model of only using the macros.
struct ActorMetrics {
    messages_received: Counter,
    ...
}

impl ActorMetrics {
    pub fn create() -> Self {
        Self {
            messages_received: register_counter!("elfo_actor_messages_received"),
        }
    }
}

// Now you would make these changes in your `Recorder` to support returning a handle
// that can affect multiple metrics at the same time:
impl Recorder {
    fn with_params<R>(&self, f: impl Fn(&Storage, &Scope, bool) -> R) -> Option<Vec<R>> {
        scope::try_with(|scope| {
            let result = Vec::new();
            let perm = scope.permissions();
            if perm.is_telemetry_per_actor_group_enabled() {
                result.push(f(&self.storage, scope, false));
            }
            if perm.is_telemetry_per_actor_key_enabled() && !scope.meta().key.is_empty() {
                result.push(f(&self.storage, scope, true));
            }
            result
        });
    }
}

impl metrics::Recorder for Recorder {
    fn register_counter(&self, key: &Key) -> Counter {
        // `Storage` still controls and owns all data for all registered metrics, but now
        // you return individual `Counter` handles for each metric, and then wrap them in
        // a newtype that allows for returning a single `Counter` here in `register_counter`.
        let maybe_counters = self.with_params(|storage, scope, with_actor_key| {
            storage.register_counter(scope, key, with_actor_key)
        });
        maybe_counters.map(CounterGroup::wrap).unwrap_or_else(|| Counter::noop())
    }

    ...
}

struct CounterGroup {
    // Possible optimization is that you just return a direct `Arc<Atomic...>` clone
    // from `Storage::register_*` methods, since `Recorder::with_params` is generic
    // over return type, which would be slightly more efficient since there would be
    // no optional check like there is in `Counter` itself: the `Vec` having items, or
    // not, is what now dictates whether we update any metrics.
    counters: Vec<Counter>,
}

impl CounterGroup {
    fn wrap(counters: Vec<Counter>) -> Counter {
        let grouped = CounterGroup { counters };
        Counter::from_arc(grouped)
    }
}

impl CounterFn for CounterGroup {
    fn increment(&self, value: u64) {
        for counter in &self.counters {
            counter.increment(value);
        }
    }

    ...
}

While this code is not perfect -- there might be a restriction I'm not thinking of -- I think it could represent a workable design for your use case. Some benefits:

To your point about runtime changes, that is definitely trickier to solve... but it might be solvable in a generic way by using something like a configuration epoch? For example:

As I was typing that, I had the thought that it already wouldn't 100% work because you would need mutability to replace the existing counters in the proposed CounterGroup. However, if you used the proposed ActorMetrics approach, you would own that and thus could implement such functionality at that level instead.

An atomic load for every metric emit isn't great, but overall, I believe my proposed design would still be faster overall as there would be less centralized access to the Registry where the internal hashmap is more of a bottleneck than an atomic load.

loyd commented 2 years ago

The main disadvantage of the suggested approach is that the metric set is unbound: all metrics produced by user-written code should use the same approach, not only provided by elfo ones. Moreover, the code producing metrics may not even know that is called inside an actor system. metrics is becoming de-facto the default ecosystem for collecting metrics in Rust, so it's a matter of time when some HTTP/WS/DB/etc client will include custom metrics using metrics. It's undesirable to force all of them to cache metrics instead of just using increment_counter!("ws_messages_received_total") in place, which becomes expensive immediately because of the expensive registration step. Nobody expects that this line can be expensive. For instance, tracing caches all registrations after calling Subscriber::register_callsite and provides a method to rebuild that cache. This doesn't solve my problem, but I want to show, that it's unexpected to call register() on hot paths anyway. Also, in this case, the previous implementation is expressible by storing Key in CounterGroup.

Maybe Recorder should have methods increment_counter/etc with the default implementation self.register_counter().increment(). Although this can prevent further optimizations of macros (like storing in static mentioned above).

you would own that and thus could implement such functionality at that level instead.

As I said above, metrics produced by users should behave similarly.

I had the thought that it already wouldn't 100% work because you would need mutability to replace the existing counters in the proposed

Yep, it's problematic. It's a reason, why arc-swap exists, and even it is 20x slower than atomic accesses (but still x20 faster parking_lot. Need to check seqlock also).

direct access to the atomic storage: you don't even need to query the registry hashmap anymore if metric name/labels are static and don't change where the internal hashmap is more of a bottleneck than an atomic load.

As I see, it's a sharded map internally. Of course, even with precalculated hash, it's still slower than atomic loads. I'm sure, that the handle API + ArcSwap in CounterGroup should work better. In general, I welcome the change, my concerns are about macros and unexpected overhead for users because of re-registering on each call.

tobz commented 2 years ago

The main disadvantage of the suggested approach is that the metric set is unbound: all metrics produced by user-written code should use the same approach, not only provided by elfo ones.

There's no reason you can't limit it to metrics that use a prefix of elfo_, etc.

Moreover, the code producing metrics may not even know that is called inside an actor system. metrics is becoming de-facto the default ecosystem for collecting metrics in Rust, so it's a matter of time when some HTTP/WS/DB/etc client will include custom metrics using metrics.

This is an explicit goal of metrics: just like tracing, an application can get metrics from its dependencies for "free", just by installing a recorder.

It's undesirable to force all of them to cache metrics instead of just using increment_counter!("ws_messages_received_total") in place, which becomes expensive immediately because of the expensive registration step. Nobody expects that this line can be expensive. ... This doesn't solve my problem, but I want to show, that it's unexpected to call register() on hot paths anyway.

I think an important distinction is that "registration" in 0.18 is no different than actually "doing" the specific metric operation in 0.17. You never know if a metric exists yet, so the registry always has to look up the key, see if it exists, clone the key and create the storage if it didn't already exist, etc.

The primary difference in 0.18 is that we've separated looking up/registering a metric and actually performing the operation that updates it.

Yep, it's problematic. It's a reason, why arc-swap exists, and even it is 20x slower than atomic accesses (but still x20 faster parking_lot. Need to check seqlock also).

This line is a good point, and it actually made me realize that the Counter/Gauge/Histogram probably should have had mutable interfaces since it would have enabled more interesting designs such as aggregating internally and then emitting a single time at Drop, etc.

In general, I welcome the change, my concerns are about macros and unexpected overhead for users because of re-registering on each call.

I had already mentioned above the subtlety around the changed naming, but I want to dig in a little more on the actual overhead/performance.

So, in general, the "registering" step is not that dissimilar to the code that existed before. For a call of counter!:

In 0.17:

In 0.18:

In both cases, having to look up the key in the registry's sharded map is the most expensive part by far. It outweighs the cost of cloning the Arc, although cloning the Arc does involve atomics, so in some cases, there's the potential for the compiler to make sub-optimal choices that reduce instruction-level parallelism, etc.

The explicit trade-off I made in 0.18 was that the performance for "non-optimal" usage -- calling the macros only, maybe with dynamic labels, etc -- becoming slightly worse was worth it for the improvement in the scenarios where applications can pre-register metrics.

In practical terms, the overhead is still very small. If we look at metrics-benchmark, a very stupid program that emits metrics as fast as possible (a counter, a gauge, then a histogram; all static keys), we can see that for a single producer, both 0.17 and 0.18 can easily push 30-35M calls/second on a modern CPU. If we increase the parallelism of the "producer" threads to 16, 0.17 becomes slightly faster than 0.18, managing to push ~14.5M calls/second vs ~12M calls/second, respectively3 While 12 vs 14.5 is an 18% difference... you still have to fully saturate 16 producer threads to get to that much of a difference.

If you're pushing that many metrics per second, or even 10% of that amount, I would argue that you likely understand the performance well enough to tweak how/when you emit metrics to avoid incurring undesired overhead.

The flipside of all of this is what 0.18 unlocks: the ability to hyperoptimize metrics overhead by using handles directly. While the ~30-35M calls/second for a single producer was the best you could do in 0.17, even using entirely static keys, using metric handles directly in 0.18 blows it away with ~110M calls/second. Even with a producer parallelism of 16, it's still at ~35M calls/second, over 2x what can be done when only using the macros.

loyd commented 2 years ago

In 0.18

You're considering the usual usage of metrics, not the described by me. In my case, CounterGroup should include more information for re-registering, if the config's epoch is changed, including a clone of Key and, maybe, another Arc<Storage>.

if registration only happened once (like in tracing), it would be possible to clone Key and needed additional information only once.

However, implementing https://github.com/metrics-rs/metrics/issues/273 also can help in this case.

probably should have had mutable interfaces since it would have enabled more interesting designs such as aggregating internally and then emitting a single time at Drop, etc.

A good direction, it can solve my problem. At least, Registry should be generic over the handles type, now it uses a private Primitives. I'm glad that you agree about the necessity in the generalization of Registry.

in some cases, there's the potential for the compiler to make sub-optimal choices that reduce instruction-level parallelism

This is a much smaller problem than false sharing.

a very stupid program that emits metrics as fast as possible

Without any complex keys. In practical terms, most metrics contain labels, often dynamic. And in 0.18 I'm forced to clone it on every call.

The flipside of all of this is what 0.18 unlocks

I don't argue with you about the pros of a separate registration step. I'm just trying to show that the current API is not complete enough. Either need to somehow cache a handle per macro call, either provide Recorder::increment_counter (with default register_counter().increment() implementation) and make it possible to override the behavior like in my cases.

tobz commented 2 years ago

Just to take a step back...

My assertion here is that between 0.17 and 0.18, there wasn't a meaningful change to performance/efficiency for callers who just used counter!("my_metric_name", 1, "label" => "value", ...):

Assuming I make the changes to Registry to allow keys to be user-specified again, I believe that lets you switch to 0.18 with no meaningful change in performance. Do you agree with that? Do you not agree with it? If so, what do you think is missing from my understanding or proposal?

tobz commented 2 years ago

Also, separately:

A good direction, it can solve my problem. At least, Registry should be generic over the handles type, now it uses a private Primitives.

You're right, and this was an oversight on my part. I opened #274 to fix that.

tobz commented 2 years ago

I just cut a new release -- metrics-util@v0.12.0 -- which covers a few changes pursuant to this issue:

I think these two things addressed the bulk of your original issues using the new handle-based design, so I'd be interested in hearing any feedback you have if you have a chance to try them out.

loyd commented 2 years ago

I'm going to update metrics in elfo next week, now it seems to cover most of my questions, maybe all. I'll be back with feedback. Thanks a lot.

tobz commented 2 years ago

Sounds good. 👍🏻

I'm also working on refactoring metrics::Cow to additionally allow for wrapping an Arc<T> (in addition to the normal support for &'a T and T) and exploring a design for inlining strings directly, per your comment in #273.

loyd commented 3 months ago

A half year ago, I made the storage implementation TLS-based to avoid contention if multiple actors share the same telemetry key. Yep, contention on a simple atomic counter, which I observed in my production. Thus, now I'm even more stuck with 0.17 version =(

It seems complicated to support TLS-based storage adequately with the current approach because Arc<dyn SomeFn> cannot point to a specific memory location. It changes over time. It's the same problem as above, so I'm writing here instead of creating a new issue.

Which interface would cover such usage? We have two usage patterns:

  1. Acquiring the handle and storing it on the user side. Cluttering up the product logic, but avoid hashmaps (however, if a metric contains dynamic labels, some maps will be anyway).
  2. Log-like usage in place.

Both patterns are valid and can be used in different situations. metrics <=v0.17 was focused on the second pattern and >v0.17 on the first one.

So, I think the crate should allow redefining in-place (the second pattern) directly by implementing Recorder::increment_counter() and so on. What do you think about this, @tobz? It seems backward-compatible.

The first pattern is also tricky with TLS-based storages: if the current thread is changed, we should re-acquire the handle:

// Just example to illustrate
impl Recorder for MyTlsRecorder {
    fn register_counter(&self, key: &Key, metadata: &Metadata<'_>) -> Counter {
        let entry: &Box<u64> = self.threadwise.get_or_create(key, metadata);

        Counter::from_arc(GuardedCounter {
            thread_id: current_thread_id(),
            entry, // stable address
        })
    }
}

struct GuardedCounter {
    thread_id: ThreadId,
    entry: *const u64,
}

impl CounterFn for GuardedCounter {
    fn increment(&self) {
        if current_thread_id() == self.thread_id {
            unsafe { *self.entry } += 1;
        } else {
            // Re-register the counter in the current thread
        }
    }
}

Metadata contains a lifetime, so we cannot simply clone key + metadata. However, it can be addressed by storing the needed information in the recorder impl, so it is tricky but not impossible.

tobz commented 3 months ago

Just wanted to say I saw this and I do plan on responding, but there's a lot here and life is a little busy for me at the moment so I need to find some time to digest everything you've laid out here so I can give a proper response. 😅

loyd commented 3 months ago

Yes, indeed, we're having a sluggish conversation here)

tobz commented 3 months ago

After reading your post, I think I understand now what you ended up having to do for your use case and why it's hard to achieve in metrics v0.18 and above.

I don't love the idea of adding new methods to Recorder, but for the sake of weighing the pros and cons:

Maybe I'm missing an obvious way to keep the handle types as-is (i.e. keep the existing pattern of counter!(...).increment(1), gauge!(...).set(42), etc), but this feels like it would be a big regression in simplicity of the current API.

Would it be helpful if we could somehow relax the requirements to allow for, say, Rc<T> or Arc<T>? I don't know if it's trivially possible, but I'm trying to think through what a smaller change could be that unlocks the functionality that you need.