Open sepich opened 4 years ago
Are there any errors in the log? We considered timeouts before: https://github.com/prometheus/node_exporter/issues/244 but decided against it for the kind of issues described there. Wondering what is going on here. This is the first report I see about this, so would like to understand why it takes so long first.
FWIW I see the same long durations for netclass on some of our Kubernetes clusters (GKE). No errors in the node_exporter logs. Affected clusters are running 1.0.0 on Google COS, kernel 4.19.112+. Unaffected clusters are running 0.18.0 on Google COS, 4.14.174+. I tried bumping one cluster to 1.0.0, and while the average collection time increase from 10 ms to 15 ms, it isn't exactly timing out...
I had a look in /sys/class/net/ on one of the affected nodes, and there isn't an excessive amount of interfaces, or anything that seems strange when accessing files in there.
We don't have any interesting configuration of the node_exporter in either affected or unaffected clusters, no filtering or such.
I tried building a minimal program that uses the same sysfs
package to read netclass and time it. It takes a handful of milliseconds to run on one of the nodes that is affected, so it isn't anything obvious with the actual reading.
Noticed that our containers were getting cpu throttled on the affected clusters. Removed the CPU limit to tune for actual usage (which seems higher than previous releases) @sepich do you have cpu limits set on your daemonset?
Are there any errors in the log?
Nothing unusual, i'll try to redeploy and check --log.level=debug
logs
do you have cpu limits set on your daemonset?
We know about cpu limit related issues and do not use them. But the point of original question is more about generic case of dropping whole scrape, due to single collector timeout.
but decided against it for the kind of issues described there.
Thank you for providing the information! I've read the https://github.com/prometheus/node_exporter/issues/244 and unfortunately do not see any good reasons there against context.withTimeout
for each individual collector.
Yes, that would leak goroutines in case of stuck nfs mount. But that's not the only issue which could happen, as provided above.
I do not suggest to hide this problematic nfs mount by returning the metrics, in such a case i suggest just setting node_scrape_collector_success=0
for collector in question (or something like that)
For example in case of prometheus and some target connection timeout, it is only this target has loosing the data. And additional metric (up=0
) is set to show the issue. This connection timeout does not bring the whole prometheus down, or affects scraping of it's other targets.
Wouldn't this timeout improve overall reliability of node_exporter?
I'm not against adding timeouts to collectors in general. @SuperQ @pgier wdyt? Also, it would help if we had more detailed collector duration metrics in these cases..
I am having a problem to setup my node exporter, it says Oct 07 09:05:28 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:28Z" level=error msg="ERROR: diskstats collector failed a Oct 07 09:05:33 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:33Z" level=error msg="ERROR: diskstats collector failed a Oct 07 09:05:38 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:38Z" level=error msg="ERROR: diskstats collector failed a Oct 07 09:05:43 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:43Z" level=error msg="ERROR: diskstats collector failed a Oct 07 09:05:48 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:48Z" level=error msg="ERROR: diskstats collector failed a Oct 07 09:05:53 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:53Z" level=error msg="ERROR: diskstats collector failed a Oct 07 09:05:58 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:58Z" level=error msg="ERROR: diskstats collector failed a Oct 07 09:06:03 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:06:03Z" level=error msg="ERROR: diskstats collector failed a Oct 07 09:06:08 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:06:08Z" level=error msg="ERROR: diskstats collector failed a Oct 07 09:06:13 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:06:13Z" level=error msg="ERROR: diskstats collector failed a
Did I make something wrong or because my OS problem? I use "Debian GNU/Linux 10 (buster)"
Please help me
@zoeldjian This looks like an unrelated issue, best ask on the mailinglist unless you are sure it's a bug in the node-exporter.
@discordianfish WRT https://github.com/prometheus/node_exporter/issues/1841#issuecomment-697372257, do you think we should add timeouts for all collectors? The timeouts would be specific to any behaviors (and help us curb them) that would otherwise create latency for the entire metrics backed.
I'm +1 for it, and I can send out a PR for the same if this is a welcome change.
@rexagod Yeah I was just mentioning this in #2960 - sounds like a good idea.
Host operating system: output of
uname -a
Linux 4.4.207-1.el7.elrepo.x86_64 #1 SMP Sat Dec 21 08:00:19 EST 2019 x86_64 x86_64 x86_64 GNU/Linux
node_exporter version: output of
node_exporter --version
prom/node-exporter:v1.0.1
node_exporter command line flags
Are you running node_exporter in Docker?
Yes, in k8s as a DaemonSet
What did you do that produced an error?
We're using
scrape_interval: 15s
andscrape_timeout: 15s
on prometheus side, and noticed that some nodes have holes in graphs: Which turns out to be due to large scrape time frombonding
andnetclass
collectors:node_scrape_collector_duration_seconds
Sometimes even like this:If we disable these collectors:
Then holes disappear (on graphs above after 17:30)
What did you expect to see?
Bonding
collector metrics are very valuable for us. Currently we have to produce same metrics via textfile collector and custom script. Is it possible to maybe add some configurable timeout for node_exporter, so that at least some metrics which are ready would be returned? Instead of failing the whole scrape. In this case collectors maybe should also setnode_scrape_collector_success=0
to not hide the issue. Thank you.