Open sargun opened 5 years ago
Hi @sargun - what kind of metrics would you like to see on a per-cgroup basis?
Some of the perf metrics would be very valuable to see per cgroup. For example, it'd be valuable to say "all cgroups under dir X" should have perf events cycles, and instructions monitored.
Per-cgroup performance counters are quite expensive in-practice. While possible to collect at this level, some careful thought would need to go into strategies to help mitigate the performance overhead. While I do feel this would fit into the project, I don't have any plans to work on this soon.
Perhaps, do you have an opinion on how they should be exported, or how API / config should look?
I think for regular exposition, it'd be best to gather the metrics under paths like: cgroup/[name]/[sampler]/[metric]
- for instance, if we had a cgroup named 'foo' the instructions
performance counter would be cgroup/foo/perf/cpu/instructions
.
Ideally, for prometheus exposition, we would probably use cgroup/perf/cpu/instructions
with a label containing the cgroup name. Today, this isn't directly supported in the metrics library. I don't believe this is necessarily a blocker for this feature though.
Configuration gets a little tricky, we need an optional list of cgroups to instrument, and we need to indicate which samplers should collect per-cgroup telemetry. For instance, both perf
and cpu
would be things we could sample per-cgroup. Users may wish to collect per-cgroup for one, both, or none. Additionally, users may wish to collect for named cgroups or all cgroups. The real trick is going to be making this feel clean and natural. The per-cgroup collection should be opt-in.
More importantly, we need to consider the performance impact. Even having a single hardware event, such as cpu instructions, instrumented with perf on a per-cgroup basis can have significant performance penalty for the running services. This becomes apparent when multiple cgroups are used for isolation. Ideally, Rezolus should not cause measurable impact to running services. We want to be sure we're keeping the overhead low both within the project and in terms of impact across the system. A strategy needs to be developed to help mitigate the performance penalty. Based on prior experiments with this, I don't believe this work is trivial. This might not make a good first issue.
How do you feel about supporting this for non-hardware events? For example, context-switches, and syscalls?
Same. There's measurable performance impact to sensitive workloads when gathering even SW events per-cgroup when there are several cgroups on the system. I'll give some thought to this and see if there's some way to enable this work to happen or if it'll be possible for me to take this on myself.
Is there any intent to support gathering metrics per cgroup?