sched-ext / scx

sched_ext schedulers and tools
https://bit.ly/scx_slack
GNU General Public License v2.0
691 stars 48 forks source link

rusty: Integrate stats with the metrics framework #377

Closed jfernandez closed 2 weeks ago

jfernandez commented 2 weeks ago

We need a layer of indirection between the stats collection and their output destinations. Currently, stats are only printed to stdout. Our goal is to integrate with various telemetry systems such as Prometheus, StatsD, and custom metric backends like those used by Meta and Netflix. Importantly, adding a new backend should not require changes to the existing stats code.

This patch introduces the metrics [1] crate, which provides a framework for defining metrics and publishing them to different backends.

The initial implementation includes the dispatched_tasks_count metric, tagged with type. This metric increments every time a task is dispatched, emitting the raw count instead of a percentage. A monotonic counter is the most suitable metric type for this use case, as percentages can be calculated at query time if needed. Existing logged metrics continue to print percentages and remain unchanged.

A new flag, --enable-prometheus, has been added. When enabled, it starts a Prometheus endpoint on port 9000 (default is false). This endpoint allows metrics to be charted in Prometheus or Grafana dashboards.

Example of charting 1s rate of dispatched tasks by type: image

Future changes will migrate additional stats to this framework and add support for other backends.

[1] https://metrics.rs/

jfernandez commented 2 weeks ago

This looks very similar to what I added to scx_layered. I'm mostly happy with the approach, but there's still a fair bit of boilerplate. One idea I kicked around was making a procedural macro to define stats from a Struct which would generate a lot of this boilerplate.

@dschatzberg can you give an idea of what that macro would look like? In general, I prefer not to add additional abstractions to metric clients and to be as close to the API as possible. But I'd be open to adding a macro if you give me some ideas of what it would look like.

Byte-Lab commented 2 weeks ago

This looks excellent, thanks again for working on it. Once we've addressed the few things that @dschatzberg pointed out, I think we can land this and iterate in tree. Something I do think we'll want to consider addressing longer-term (which we already discussed offline, but just recording here for others' benefit): I do think that we could benefit from having a metrics crate that abstracts some of this stuff and avoids boilerplate, as I expect that a lot of the schedulers will be exporting stats in nearly the same way. In general it seems like schedulers do the following:

- Define various stats as a big enum -> record those stats in a per-cpu array map in BPF -> read those stats in user space and record them elsewhere (i.e. to stdout rihgt now). It'd be pretty slick if we could integrate with the build system and have some of this boilerplate be auto-generated.

If we could somehow make that declarative (sort of similar to https://github.com/sched-ext/scx/blob/main/scheds/c/scx_nest_stats_table.h and https://github.com/sched-ext/scx/blob/main/scheds/c/scx_nest.c#L210-L219 in scx_nest), I think it'd let us get rid of a good amount of code, and would it a lot easier both to add stats, and to understand them.

In any case, this LG for now! Once Dan's changes are addressed, I'll stamp and merge.

htejun commented 2 weeks ago

This looks better than OM and no objection from me. Some things to consider for the future:

jfernandez commented 2 weeks ago

@Byte-Lab @htejun agreed and aligned on the long-term vision. My plan is to use scx_rusty to learn the common observability patterns and iterate to generalize this solution for all schedulers. I'll start working soon on removing boilerplate.

@dschatzberg I applied your feedback and I also created a Struct to hold the metrics. If you are not opposed, I'd like to first create a terminal logging backend so that we can clean up the report fn, and then circle back to removing boilerplate with macros as I mentioned above.

dschatzberg commented 2 weeks ago

@jfernandez Yeah, that's totally fine by me. The idea I had regarding a macro was to make it more like the clap crate works with CLI options, where decorators on each field in the Metrics struct would allow a macro to generate the new() impl and even the fetch + increment bits.

BTW, do you need .absolute() and not .increment() now that the counters are pre-registered?

jfernandez commented 2 weeks ago

The idea I had regarding a macro was to make it more like the clap crate works with CLI options, where decorators on each field in the Metrics struct would allow a macro to generate the new() impl and even the fetch + increment bits.

Ah yeah, I'm familiar with this pattern and I like it as well. I will explore this when I focus on removing boiler plate.

BTW, do you need .absolute() and not .increment() now that the counters are pre-registered?

@dschatzberg no, .increment() is still the correct function to call. Internally, counters should be a monotonic incrementing value. A gauge! is what we'd use if we want to instrument an absolute value. In V1 of this PR, calling counter! was correctly only registering the metric once and subsequent calls would return the cached metric. But initializing once and storing in the struct is a bit more efficient and sets us up for removing boiler plate down the line.

How these monotonic counters are exported is up to the target backend. For Prometheus, since it's a poll-based system, the counter increases the value in memory until there is a scrape event, then it gets reset back to 0. Then the Prometheus backend simply adds to the value and you need to chart it with rate(dispatched_tasks_count[1m]) to see the rate per second.

For something like Netflix's spectatord, state is handled by the daemon, and we'd emit the metric on every increment! call without needing to store any state.

jfernandez commented 2 weeks ago

I pushed one more change, I forgot to append _count to the end of the metric names after my refactor. That is a convention required by Prometheus that I think we should follow to be more explicit about metric types.

jfernandez commented 2 weeks ago

One more change, I got the convention wrong. It's _total, not count. Updated.

Byte-Lab commented 2 weeks ago

The idea I had regarding a macro was to make it more like the clap crate works with CLI options, where decorators on each field in the Metrics struct would allow a macro to generate the new() impl and even the fetch + increment bits.

Ah yeah, I'm familiar with this pattern and I like it as well. I will explore this when I focus on removing boiler plate.

BTW, do you need .absolute() and not .increment() now that the counters are pre-registered?

@dschatzberg no, .increment() is still the correct function to call. Internally, counters should be a monotonic incrementing value. A gauge! is what we'd use if we want to instrument an absolute value. In V1 of this PR, calling counter! was correctly only registering the metric once and subsequent calls would return the cached metric. But initializing once and storing in the struct is a bit more efficient and sets us up for removing boiler plate down the line.

@jfernandez I'm a bit confused here. These counter!s are monotonic, but they're incremented and never reset from the BPF side. So if we incremented the values then we'd be incrementing using absolute values. Can we not use the absolute() function? In my mind a gauge is meant for something that's non-monotonic.

Edit: I see why we'd want to increment here instead of setting absolute if it's scraped and reset by the target backend, but if we do that I think we'd need to track what the actual increment was relative to the last time we collected the stats.

How these monotonic counters are exported is up to the target backend. For Prometheus, since it's a poll-based system, the counter increases the value in memory until there is a scrape event, then it gets reset back to 0. Then the Prometheus backend simply adds to the value and you need to chart it with rate(dispatched_tasks_count[1m]) to see the rate per second.

Hmm, I'm still not quite following how we won't need state to track this (meaning, we record what the value was last time, and then increment based on that). Even if we emit the metric on every call to increment!(), we're still emitting an absolute value.

For something like Netflix's spectatord, state is handled by the daemon, and we'd emit the metric on every increment! call without needing to store any state.

jfernandez commented 2 weeks ago

@Byte-Lab ah! I assumed that the bpf stats were being reset for each loop. If they are always incrementing, then I need to rework this. Let me get back to you on this.

Byte-Lab commented 2 weeks ago

@Byte-Lab ah! I assumed that the bpf stats were being reset for each loop. If they are always incrementing, then I need to rework this. Let me get back to you on this.

Sorry @jfernandez, as we discussed on slack you weretotally correct. The values are reset on each read here: https://github.com/sched-ext/scx/blob/main/scheds/rust/scx_rusty/src/main.rs#L433-L435. What you have now LG.