prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11.04k stars 2.34k forks source link

node-exporter retrying every second #1163

Open c0debreaker opened 5 years ago

c0debreaker commented 5 years ago

Today, we had a Jenkins outage. During the outage, I checked /var/log/messages, we have tons of these messages and updating continiously

Nov 18 14:55:36 jenkinscid-127_0_0_1 node-exporter[998]: 2018/11/18 14:55:36 http: Accept error: accept tcp [::]:9100: accept4: too many open files; retrying in 1s
Nov 18 14:55:37 jenkinscid-127_0_0_1 node-exporter[998]: 2018/11/18 14:55:37 http: Accept error: accept tcp [::]:9100: accept4: too many open files; retrying in 1s
Nov 18 14:55:38 jenkinscid-127_0_0_1 node-exporter[998]: 2018/11/18 14:55:38 http: Accept error: accept tcp [::]:9100: accept4: too many open files; retrying in 1s
Nov 18 14:55:39 jenkinscid-127_0_0_1 node-exporter[998]: 2018/11/18 14:55:39 http: Accept error: accept tcp [::]:9100: accept4: too many open files; retrying in 1s
Nov 18 14:55:40 jenkinscid-127_0_0_1 node-exporter[998]: 2018/11/18 14:55:40 http: Accept error: accept tcp [::]:9100: accept4: too many open files; retrying in 1s
Nov 18 14:55:41 jenkinscid-127_0_0_1 node-exporter[998]: 2018/11/18 14:55:41 http: Accept error: accept tcp [::]:9100: accept4: too many open files; retrying in 1s
Nov 18 14:55:42 jenkinscid-127_0_0_1 node-exporter[998]: 2018/11/18 14:55:42 http: Accept error: accept tcp [::]:9100: accept4: too many open files; retrying in 1s

Then I ran lsof and sorted it to find out who the topmost processes that have open files

COUNT PID PROCESS
14247 998 node_expo
 544 12005 nginx
 496 10381 python
 304 898 amazon-ss
 240 633 gssproxy
 209 1014 collectd
 144 23672 sshd
 144 21935 sshd
 140 23666 sshd
 140 21930 sshd
 102 511 sssd_be
  99 10417 python
  99 10410 python
  99 10408 python
  99 10402 python
  99 10395 python
  99 10393 python
  97 10390 python
  96 470 sssd
  95 603 sssd_nss

We're using AWS EFS. I was thinking that our NFS mount went away. However, running cat /proc/mounts still showed it was mounted. I can still netcat to port 2049. However, I was unable to run ls /var/lib/jenkins. When I stopped the jenkins service, that was the only time I was able to access /var/lib/jenkins.

Kernel is 4.16.11-100.fc26.x86_64 #1 SMP Tue May 22 20:02:12 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Output of cat /proc/version is

Linux version 4.16.11-100.fc26.x86_64 (mockbuild@bkernel02.phx2.fedoraproject.org) (gcc version 7.3.1 20180130 (Red Hat 7.3.1-2) (GCC)) #1 SMP Tue May 22 20:02:12 UTC 2018

node_exporter is

node_exporter, version 0.16.0 (branch: HEAD, revision: d42bd70f4363dced6b77d8fc311ea57b63387e4f)
  build user:       root@a67a9bc13a69
  build date:       20180515-15:52:42
  go version:       go1.9.6

ulimit -a

core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 62211
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 62211
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
discordianfish commented 5 years ago

This is probably because some collectors didn't return and kept fds open. Do you have a NFS mount on that system? I suspect the same as #244?

c0debreaker commented 5 years ago

Yes, we do have NFS mounted. I also noticed our open file FD is 1024. Could this number be the cause? I changed it to 65536 for now. However, I am not sure if it will fix it. So far, it's still working and haven't noticed Jenkins freezing.