prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11.05k stars 2.34k forks source link

Qdisc collector does not expose queues with a parent #3088

Open bh-tt opened 1 month ago

bh-tt commented 1 month ago

Host operating system: output of uname -a

Linux k8s-secnet-node6 6.1.0-23-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.99-1 (2024-07-15) x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 1.8.2 (branch: HEAD, revision: f1e0e8360aa60b6cb5e5cc1560bed348fc2c1895)
  build user:       root@e8029641b208
  build date:       20240806-20:45:43
  go version:       go1.21.13
  platform:         linux/amd64
  tags:             unknown

node_exporter command line flags

--path.procfs=/host/proc --path.sysfs=/host/sys --web.listen-address=0.0.0.0:9100 --collector.qdisc     

node_exporter log output

not relevant

Are you running node_exporter in Docker?

yes, we have correctly exposed the host /proc and /sys.

What did you do that produced an error?

We enabled qdisc metrics to correlate a networking issue with packet drops in an eBPF program, but it turns out that node_exporter only gives metrics for the qdisc that have no parent. With tc -s qdisc show we see a lot of packet drops on ebpf qdiscs (type clsact) which have a parent qdisc defined, but because node_exporter does not expose these it is very hard to correlate our networking issues with packet drops here. Looking at the implementation this is logical, since node_exporter by default skips all qdiscs that have a parent.

See https://github.com/prometheus/node_exporter/blob/b9d0932179a0c5b3a8863f3d6cdafe8584cedc8e/collector/qdisc_linux.go#L151

What did you expect to see?

I expected to see metrics for all qdiscs on the host, and to let users worry about possible cardinality issues. This collector is disabled by default anyway.

What did you see instead?

I saw only metrics for the root qdisc, which in this case is not that relevant.

Possible fixes

discordianfish commented 2 weeks ago

Yeah dunno why we only expose root level, probably to ensure the metrics can be summed up without sepearating root and child queues. Any suggestions how to handle this?