prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11.14k stars 2.35k forks source link

too many open files #627

Closed niravpat closed 5 years ago

niravpat commented 7 years ago

Hello, I am seeing too many open files in the logs.

start_node_exporter.sh[59668]: 2017/07/26 11:47:37 http: Accept error: accept tcp [::]:9100: accept4: too many open files; retrying in 1s start_node_exporter.sh[59668]: 2017/07/26 11:47:38 http: Accept error: accept tcp [::]:9100: accept4: too many open files; retrying in 1s start_node_exporter.sh[59668]: 2017/07/26 11:47:39 http: Accept error: accept tcp [::]:9100: accept4: too many open files; retrying in 1s start_node_exporter.sh[59668]: 2017/07/26 11:47:40 http: Accept error: accept tcp [::]:9100: accept4: too many open files; retrying in 1s start_node_exporter.sh[59668]: 2017/07/26 11:47:41 http: Accept error: accept tcp [::]:9100: accept4: too many open files; retrying in 1s start_node_exporter.sh[59668]: 2017/07/26 11:47:42 http: Accept error: accept tcp [::]:9100: accept4: too many open files; retrying in 1s

using lsof showed that it's consuming 1030 file descriptors(lsof output attached). the soft limit is 1024. I know it's too low but should node exporter be using that many file descriptors ? i have seen node exporter on other nodes take up less than 30 file descriptors. please see lsof output attached, l see lots of open sockets in it. netstat output showed lots of tcp connection in CLOSE_WAIT state. these are coming from different prometheus servers.

i was wondering if this was similar to #244 but we are not using any NFS. i took goroutine dump (attached) which shows lots of threads stuck at

and

mdlayher commented 7 years ago

The other thing that jumps out at me is lots of godbus connection goroutines in IO wait. What happens if you run node_exporter without the systemd collector?

niravpat commented 7 years ago

Yes, we are using systemd collector. I will have to let it run for few days without systemd collector and see. But then again, I have node exporter running on bunch of other nodes WITH systemd collector but not all of them have this issue. So even if i take out systemd and the issue doesn't reproduce, we won't know for sure that taking out systemd made it work.

niravpat commented 7 years ago

I am not able to reproduce this either way. any suggestions?

roadrunner commented 7 years ago

same here, the server was burning just like in the hell. i had to kill and disable the node exporter service. coz load was around ~ 1K

Aug 15 20:02:05 fr5 smartd[25441]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 61 to 62
Aug 15 20:02:05 fr5 smartd[25441]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 61 to 62
Aug 15 20:04:18 fr5 systemd[1]: nodeexporter.service: Service hold-off time over, scheduling restart.
Aug 15 20:04:18 fr5 systemd[1]: Stopped Prometheus node exporter service.
Aug 15 20:04:18 fr5 systemd[1]: Started Prometheus node exporter service.
Aug 15 20:04:18 fr5 systemd[8840]: nodeexporter.service: Failed at step EXEC spawning /opt/prometheus/node_exporter-0.14.0.linux-amd64/node_exporter: No such file or directory
Aug 15 20:04:18 fr5 systemd[1]: nodeexporter.service: Main process exited, code=exited, status=203/EXEC
Aug 15 20:04:18 fr5 systemd[1]: nodeexporter.service: Unit entered failed state.
Aug 15 20:04:18 fr5 systemd[1]: nodeexporter.service: Failed with result 'exit-code'.
^C

root@fr5:~# w
 21:02:32 up 113 days,  4:21,  1 user,  load average: 1018.17, 1000.14, 1042.58
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
root@fr5:~# w
 21:04:19 up 113 days,  5:01,  1 user,  load average: 48.17, 75.14, 276.58
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
root@fr5:~# htop
root@fr5:~# ^C
root@fr5:~# w
 21:06:09 up 113 days,  5:03,  1 user,  load average: 12.05, 53.59, 246.25
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
roieyo commented 7 years ago

same here: time="2017-08-16T09:26:30Z" level=error msg="ERROR: stat collector failed after 0.000006s: open /proc/stat: too many open files" source="node_exporter.go:91" time="2017-08-16T09:26:30Z" level=error msg="ERROR: netstat collector failed after 0.000007s: couldn't get netstats: open /proc/net/netstat: too many open files" source="node_exporter.go:91" time="2017-08-16T09:26:30Z" level=error msg="ERROR: sockstat collector failed after 0.000007s: couldn't get sockstats: open /proc/net/sockstat: too many open files" source="node_exporter.go:91" time="2017-08-16T09:26:30Z" level=error msg="ERROR: filesystem collector failed after 0.000006s: open /proc/mounts: too many open files" source="node_exporter.go:91" time="2017-08-16T09:26:30Z" level=error msg="ERROR: meminfo collector failed after 0.000008s: couldn't get meminfo: open /proc/meminfo: too many open files" source="node_exporter.go:91" time="2017-08-16T09:26:30Z" level=error msg="ERROR: netdev collector failed after 0.000007s: couldn't get netstats: open /proc/net/dev: too many open files" source="node_exporter.go:91" time="2017-08-16T09:26:30Z" level=error msg="ERROR: vmstat collector failed after 0.000006s: open /proc/vmstat: too many open files" source="node_exporter.go:91" time="2017-08-16T09:26:30Z" level=error msg="Error reading textfile collector directory /tmp: open /tmp: too many open files" source="textfile.go:78" 2017/08/16 09:26:40 http: Accept error: accept tcp [::]:9100: accept4: too many open files; retrying in 5ms time="2017-08-16T09:26:40Z" level=error msg="ERROR: stat collector failed after 0.000015s: open /proc/stat: too many open files" source="node_exporter.go:91" time="2017-08-16T09:26:40Z" level=error msg="ERROR: meminfo collector failed after 0.000012s: couldn't get meminfo: open /proc/meminfo: too many open files" source="node_exporter.go:91" time="2017-08-16T09:26:40Z" level=error msg="ERROR: netdev collector failed after 0.000008s: couldn't get netstats: open /proc/net/dev: too many open files" source="node_exporter.go:91" time="2017-08-16T09:26:40Z" level=error msg="ERROR: vmstat collector failed after 0.000007s: open /proc/vmstat: too many open files" source="node_exporter.go:91" time="2017-08-16T09:26:40Z" level=error msg="ERROR: filefd collector failed after 0.000008s: couldn't get file-nr: open /proc/sys/fs/file-nr: too many open files" source="node_exporter.go:91" time="2017-08-16T09:26:40Z" level=error msg="ERROR: filesystem collector failed after 0.000006s: open /proc/mounts: too many open files" source="node_exporter.go:91" time="2017-08-16T09:26:40Z" level=error msg="ERROR: loadavg collector failed after 0.000011s: couldn't get load: open /proc/loadavg: too many open files" source="node_exporter.go:91" time="2017-08-16T09:26:40Z" level=error msg="ERROR: diskstats collector failed after 0.000007s: couldn't get diskstats: open /proc/diskstats: too many open files" source="node_exporter.go:91" time="2017-08-16T09:26:40Z" level=error msg="ERROR: entropy collector failed after 0.000010s: couldn't get entropy_avail: open /proc/sys/kernel/random/entropy_avail: too many open files" source="node_exporter.go:91" time="2017-08-16T09:26:40Z" level=error msg="ERROR: hwmon collector failed after 0.000008s: open /sys/class/hwmon: too many open files" source="node_exporter.go:91" time="2017-08-16T09:26:40Z" level=error msg="ERROR: mdadm collector failed after 0.000023s: error parsing mdstatus: error parsing mdstat: open /proc/mdstat: too many open files" source="node_exporter.go:91"

mdlayher commented 7 years ago

Without knowing which collectors you are running, it's impossible to pin down any problems.

As OP did, goroutine stack traces and lsof output would be more useful than log entries.

roadrunner commented 7 years ago

yeah @mdlayher, with the help of lsof, I found the issue causing infinite loop for node exporter.

there was a dead nfs mount point, and everything got to normal after I remove it.

if you're having the similar issue, try to validate your mounts cat /proc/mounts

niravpat commented 7 years ago

@roadrunner interesting. so your issue seems to be similar to #244. what did you see in lsof that pointed you to the dead nfs mount? Can you share your lsof output?

roadrunner commented 7 years ago

@niravpat I would happily share if there is any, when I run lsof, it's getting hang immediately without an output.

niravpat commented 7 years ago

@roadrunner My issue appears to be slightly different though. We are not using any nfs. I was able to get lsof but all I see is high number of sockets open.

austin987 commented 6 years ago

I'm able to reproduce this on several machines when running vegeta with 1000 connections.

node_exporter ran for a week before crashing (node 0.16.0 using provided amd64 linux binaries), running on gentoo without systemd.

node-too-many-files.txt.gz

discordianfish commented 6 years ago

Sorry for the late response!

@niravpat/@austin987 Can you provide/attach the output of lsof? Hard to say what is causing this. Beyond the known issue in #244.

discordianfish commented 5 years ago

I assume it's been causing by stale collectors like described in #244, so closing this in favor of #244.