prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
10.99k stars 2.34k forks source link

Collector netclass/bonding leads to scrape timeouts #1841

Open sepich opened 4 years ago

sepich commented 4 years ago

Host operating system: output of uname -a

Linux 4.4.207-1.el7.elrepo.x86_64 #1 SMP Sat Dec 21 08:00:19 EST 2019 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

prom/node-exporter:v1.0.1

node_exporter command line flags

            - --path.procfs=/host/proc
            - --path.sysfs=/host/sys
            - --path.rootfs=/rootfs
            - --collector.netclass.ignored-devices=^(lo|docker[0-9]|kube-ipvs0|dummy0|kube-dummy-if|veth.+|br\-.+|cali\w{11}|tunl0|tun\-.+)$
            - --collector.netdev.device-blacklist=^(lo|docker[0-9]|kube-ipvs0|dummy0|kube-dummy-if|veth.+|br\-.+|cali\w{11}|tunl0|tun\-.+)$
            - --collector.filesystem.ignored-mount-points=^/(dev|sys|proc|host|etc|var/lib/kubelet|var/lib/docker/.+|home/.+|data/local-pv/.+)($|/)
            - --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|efivarfs|tmpfs|nsfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rootfs|rpc_pipefs|securityfs|sysfs|tracefs)$
            - --collector.diskstats.ignored-devices=^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p|dm-|sr|nbd)\d+$
            - --collector.netstat.fields=^(.*_(InErrors|InErrs)|Ip_Forwarding|Ip(6|Ext)_(InOctets|OutOctets)|Icmp6?_(InMsgs|OutMsgs)|TcpExt_(Listen.*|Syncookies.*|TCPSynRetrans|TCPRcvCollapsed|PruneCalled|RcvPruned)|Tcp_(ActiveOpens|InSegs|OutSegs|PassiveOpens|RetransSegs|CurrEstab)|Udp6?_(InDatagrams|OutDatagrams|NoPorts|RcvbufErrors|SndbufErrors))$
            - --no-collector.systemd
            - --no-collector.bcache
            - --no-collector.infiniband
            - --no-collector.wifi
            - --no-collector.ipvs

Are you running node_exporter in Docker?

Yes, in k8s as a DaemonSet

What did you do that produced an error?

We're using scrape_interval: 15s and scrape_timeout: 15s on prometheus side, and noticed that some nodes have holes in graphs: image Which turns out to be due to large scrape time from bonding and netclass collectors: node_scrape_collector_duration_seconds image Sometimes even like this:

# time curl -s localhost:9100/metrics >/dev/null

real    0m42.589s
user    0m0.003s
sys 0m0.005s

If we disable these collectors:

            - --no-collector.bonding
            - --no-collector.netclass

Then holes disappear (on graphs above after 17:30)

What did you expect to see?

Bonding collector metrics are very valuable for us. Currently we have to produce same metrics via textfile collector and custom script. Is it possible to maybe add some configurable timeout for node_exporter, so that at least some metrics which are ready would be returned? Instead of failing the whole scrape. In this case collectors maybe should also set node_scrape_collector_success=0 to not hide the issue. Thank you.

discordianfish commented 4 years ago

Are there any errors in the log? We considered timeouts before: https://github.com/prometheus/node_exporter/issues/244 but decided against it for the kind of issues described there. Wondering what is going on here. This is the first report I see about this, so would like to understand why it takes so long first.

carlpett commented 4 years ago

FWIW I see the same long durations for netclass on some of our Kubernetes clusters (GKE). No errors in the node_exporter logs. Affected clusters are running 1.0.0 on Google COS, kernel 4.19.112+. Unaffected clusters are running 0.18.0 on Google COS, 4.14.174+. I tried bumping one cluster to 1.0.0, and while the average collection time increase from 10 ms to 15 ms, it isn't exactly timing out...

I had a look in /sys/class/net/ on one of the affected nodes, and there isn't an excessive amount of interfaces, or anything that seems strange when accessing files in there.

We don't have any interesting configuration of the node_exporter in either affected or unaffected clusters, no filtering or such.

carlpett commented 4 years ago

I tried building a minimal program that uses the same sysfs package to read netclass and time it. It takes a handful of milliseconds to run on one of the nodes that is affected, so it isn't anything obvious with the actual reading.

carlpett commented 4 years ago

Noticed that our containers were getting cpu throttled on the affected clusters. Removed the CPU limit to tune for actual usage (which seems higher than previous releases) @sepich do you have cpu limits set on your daemonset?

sepich commented 4 years ago

Are there any errors in the log?

Nothing unusual, i'll try to redeploy and check --log.level=debug logs

do you have cpu limits set on your daemonset?

We know about cpu limit related issues and do not use them. But the point of original question is more about generic case of dropping whole scrape, due to single collector timeout.

but decided against it for the kind of issues described there.

Thank you for providing the information! I've read the https://github.com/prometheus/node_exporter/issues/244 and unfortunately do not see any good reasons there against context.withTimeout for each individual collector. Yes, that would leak goroutines in case of stuck nfs mount. But that's not the only issue which could happen, as provided above. I do not suggest to hide this problematic nfs mount by returning the metrics, in such a case i suggest just setting node_scrape_collector_success=0 for collector in question (or something like that)

For example in case of prometheus and some target connection timeout, it is only this target has loosing the data. And additional metric (up=0) is set to show the issue. This connection timeout does not bring the whole prometheus down, or affects scraping of it's other targets.

Wouldn't this timeout improve overall reliability of node_exporter?

discordianfish commented 3 years ago

I'm not against adding timeouts to collectors in general. @SuperQ @pgier wdyt? Also, it would help if we had more detailed collector duration metrics in these cases..

zoeldjian commented 3 years ago

I am having a problem to setup my node exporter, it says Oct 07 09:05:28 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:28Z" level=error msg="ERROR: diskstats collector failed a Oct 07 09:05:33 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:33Z" level=error msg="ERROR: diskstats collector failed a Oct 07 09:05:38 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:38Z" level=error msg="ERROR: diskstats collector failed a Oct 07 09:05:43 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:43Z" level=error msg="ERROR: diskstats collector failed a Oct 07 09:05:48 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:48Z" level=error msg="ERROR: diskstats collector failed a Oct 07 09:05:53 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:53Z" level=error msg="ERROR: diskstats collector failed a Oct 07 09:05:58 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:58Z" level=error msg="ERROR: diskstats collector failed a Oct 07 09:06:03 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:06:03Z" level=error msg="ERROR: diskstats collector failed a Oct 07 09:06:08 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:06:08Z" level=error msg="ERROR: diskstats collector failed a Oct 07 09:06:13 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:06:13Z" level=error msg="ERROR: diskstats collector failed a

Did I make something wrong or because my OS problem? I use "Debian GNU/Linux 10 (buster)"

Please help me

discordianfish commented 3 years ago

@zoeldjian This looks like an unrelated issue, best ask on the mailinglist unless you are sure it's a bug in the node-exporter.

rexagod commented 6 months ago

@discordianfish WRT https://github.com/prometheus/node_exporter/issues/1841#issuecomment-697372257, do you think we should add timeouts for all collectors? The timeouts would be specific to any behaviors (and help us curb them) that would otherwise create latency for the entire metrics backed.

I'm +1 for it, and I can send out a PR for the same if this is a welcome change.

discordianfish commented 6 months ago

@rexagod Yeah I was just mentioning this in #2960 - sounds like a good idea.