vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.11k stars 1.6k forks source link

Expose pressure stall information in `host_metrics` source #15663

Open runiq opened 1 year ago

runiq commented 1 year ago

A note for the community

Use Cases

In short, PSI is like load average, but it is independent of CPU number and has subsecond temporal resolution, per-resource resolution (CPU, memory, IO), and—on a cgroupsv2 system—per-cgroup resolution. There is an introductory article by Facebook and an LWN article from when the PSI patchset was introduced to the Linux kernel (it eventually landed in kernel 4.20).

For the use cases, I'll let the original kernel patch submission speak for itself:

PSI aggregates and reports the overall wallclock time in which the tasks in a system (or cgroup) wait for contended hardware resources.

This helps users understand the resource pressure their workloads are under, which allows them to rootcause and fix throughput and latency problems caused by overcommitting, underprovisioning, suboptimal job placement in a grid, as well as anticipate major disruptions like OOM.

[…]

We also use psi memory pressure for loadshedding. Our batch job infrastructure used to use heuristics based on various VM stats to anticipate OOM situations, with lackluster success. We switched it to psi and managed to anticipate and avoid OOM kills and hangs fairly reliably. The reduction of OOM outages in the worker pool raised the pool's aggregate productivity, and we were able to switch that service to smaller machines.

Lastly, we use cgroups to isolate a machine's main workload from maintenance crap like package upgrades, logging, configuration, as well as to prevent multiple workloads on a machine from stepping on each others' toes. We were not able to configure this properly without the pressure metrics; we would see latency or bandwidth drops, but it would often be hard to impossible to rootcause it post-mortem.

We now log and graph pressure for the containers in our fleet and can trivially link latency spikes and throughput drops to shortages of specific resources after the fact, and fix the job config/scheduling.

Attempted Solutions

A tool for monitoring PSI locally is below. As for Vector, I'm currently using the exec source to scrape the metrics myself. I've named the host metrics {cpu,memory,io}_pressure_{some,total}_seconds and the per-cgroup metrics cgroup_{cpu,memory,io}_pressure_{some,total}_seconds. Both kinds of metrics are labeled with the hostname, and the per-cgroup metrics are labeled with the cgroup name as well, which is the way the host_metrics source does it.

The host-total metrics are in /proc/pressure/{cpu,memory,io}, while the per-cgroup metrics are in /sys/fs/cgroup/<cgroup>/{cpu,memory,io}.pressure. These files all have the exact same format:

some avg10=0.53 avg60=1.07 avg300=1.08 total=4315113341
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

For me, the sole interesting bit is the total counter, which is the number of microseconds since boot this resource has spent under pressure. I'm dividing that by 1000000 to get seconds, shove it into Prometheus and calculate the rate over time to get per-cgroup resource pressure.

Proposal

The host_metrics source would be the most obvious place for PSI info, seeing how it already exposes other metrics about the host and cgroups.

As for metric and label naming, see above: I'd name the host metrics {cpu,memory,io}_pressure_{some,total}_seconds and the per-cgroup metrics cgroup_{cpu,memory,io}_pressure_{some,total}_seconds (some light arithmetics involved, of course). Both kinds of metrics are labeled with the hostname, and the per-cgroup metrics are labeled with the cgroup name as well, which is the way the host_metrics source does it. All in all it's fairly obvious, I think. :)

Something to look out for would be that PSI has only been exposed since kernel 4.20. That excludes:

I'm not entirely sure whether that is a problem or not. I've noticed that Vector can deal with differences between cgroup hierarchies, so I'd assume that if the information is not available, Vector would just not expose it.

References

{cpu,memory,io}.pressure have been mentioned in https://github.com/vectordotdev/vector/issues/11251#issuecomment-1044274066, but the issue was about something else.

Version

vector 0.26.0 (x86_64-unknown-linux-gnu c6b5bc2 2022-12-05)

shomilj commented 1 year ago

@runiq can you share your configuration for the 'exec' source? We are looking to collect the same data :)