prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11.1k stars 2.35k forks source link

high io load after running for a while #70

Closed camerondavison closed 9 years ago

camerondavison commented 9 years ago

I have been running the node_exporter version 0.8.0 for a while now. Maybe 2 weeks, and I am now seeing a pretty high IO load. If I run iotop then I am seeing it reading megabytes of data in a matter of seconds. It is basically using 99% of the io on the system.

If I run lsof -p $PID then I see that there are a lot of open file descriptors to /pro/stat and /proc/$PID/limits and /proc/meminfo and /proc/$PID/stat

if this helps sudo lsof -p 25359 | awk '{print $9}' | sort | uniq -c | sort -n

      1 *:7301
      1 /dev/null
      1 /proc/diskstats
      1 /tmp/node_exporter.ip-XXXX.root.log.INFO.20150501-131008.25359
      1 /usr/lib64/ld-2.19.so
      1 /usr/lib64/libc-2.19.so
      1 /usr/lib64/libnss_files-2.19.so
      1 /usr/lib64/libpthread-2.19.so
      1 /var/lib/prometheus/node_exporter
      1 NAME
      1 anon_inode
      1 ip-XXXX.compute.internal:7301->ip-XXXX.compute.internal:49115
      2 /
      2 /proc/25359/mounts
      2 /proc/25359/net/dev
      2 socket
     12 /proc/25359/limits
     17 /proc/stat
     19 /proc/25359/stat
     20 /proc/meminfo
     47 can't identify protocol
camerondavison commented 9 years ago

After restarting it everything looks rather normal again. I tried to do a kill -QUIT and get a stack trace, but the trace itself was printing incredibly slowly and then I think that my systemd definitions killed it and restarted it in the middle. So I did not get anything.

Before I restarted it I ran strace -f -p $PID and it was only printing

Process 25359 attached with 7 threads
[pid  5361] futex(0xc2085680d8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 23093] futex(0xc20837c458, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 25364] futex(0xc208021258, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 25363] futex(0xd18520, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 25359] futex(0xd0f378, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631679, 472159297}) = 0
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631679, 472233554}) = 0
[pid 25362] clock_gettime(CLOCK_REALTIME, {1431543852, 511458926}) = 0
[pid 25362] epoll_wait(5, {}, 128, 0)   = 0
[pid 25362] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631680, 423690670}) = 0
[pid 25362] futex(0xd0e7a0, FUTEX_WAIT, 0, {60, 0}) = -1 ETIMEDOUT (Connection timed out)
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631741, 549970095}) = 0
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631741, 816163981}) = 0
[pid 25362] clock_gettime(CLOCK_REALTIME, {1431543914, 227795643}) = 0
[pid 25362] epoll_wait(5, {}, 128, 0)   = 0
[pid 25362] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631742, 140082541}) = 0
[pid 25362] futex(0xd0e7a0, FUTEX_WAIT, 0, {60, 0}) = -1 ETIMEDOUT (Connection timed out)
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631803, 334995979}) = 0
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631803, 335140475}) = 0
[pid 25362] clock_gettime(CLOCK_REALTIME, {1431543975, 609309540}) = 0
[pid 25362] epoll_wait(5, {}, 128, 0)   = 0
[pid 25362] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631803, 521924470}) = 0
[pid 25362] futex(0xd0e7a0, FUTEX_WAIT, 0, {60, 0}

Once I restarted it it was printing the expected open/read of the file descriptors every 5 seconds when my prometheus app polled it for data

juliusv commented 9 years ago

Strange! Could you try to get the output of http://<host:port>/debug/pprof/goroutine?debug=2 next time this happens?

camerondavison commented 9 years ago

No good. I cannot make that curl, it just hangs.

camerondavison commented 9 years ago

I was able to send it a sig QUIT this time, still waiting on the output, but what I have so far https://gist.github.com/a86c6f7964/f170cabd622f64c76d2e

juliusv commented 9 years ago

@a86c6f7964 We've been staring at this for a short while, and still not sure exactly what's going on, besides some goroutines being slow / blocked for long times. Hanging goroutines could be some deadlock bug, but that wouldn't really explain the high IO. Is the machine swapping when this happens? Do you happen to have process and/or machine memory metrics? (well, actually you should have both up until the point of total unresponsiveness, if there was a Prometheus scraping that node exporter)

juliusv commented 9 years ago

Interesting metrics to look at for the node-exporter process itself, which should be in the Prometheus that scraped it, up until the point when the node exporter became unscrapable:

...and machine-related metrics that would be interesting for the host where the node exporter ran on:

camerondavison commented 9 years ago

Okay I updated the gist with the full stack trace (it took almost 8 hours to generate :open_mouth: )

My prometheus only handles 4 hours worth of data (disk io got reallllllly slow when I gave it more, we are running on a AWS t2.medium machine) ... so those will be kind of hard to get at

camerondavison commented 9 years ago

The usual load on this machine is next to nothing. I have the node exporter set at a memory limit of 50M in my systemd unit (do you think that it needs more?)

Looking through the logs it looks like something happened around 'I0518 18:36:07' where things started getting slower

I0518 18:35:17.344805   30572 node_exporter.go:98] OK: stat success after 0.000160s.
I0518 18:35:22.345006   30572 node_exporter.go:98] OK: stat success after 0.000162s.
I0518 18:35:27.347632   30572 node_exporter.go:98] OK: stat success after 0.000157s.
I0518 18:35:32.344479   30572 node_exporter.go:98] OK: stat success after 0.000167s.
I0518 18:35:37.346816   30572 node_exporter.go:98] OK: stat success after 0.000159s.
I0518 18:35:42.345793   30572 node_exporter.go:98] OK: stat success after 0.000169s.
I0518 18:35:47.345620   30572 node_exporter.go:98] OK: stat success after 0.000162s.
I0518 18:35:52.348205   30572 node_exporter.go:98] OK: stat success after 0.000185s.
I0518 18:35:57.344761   30572 node_exporter.go:98] OK: stat success after 0.000172s.
I0518 18:36:02.345716   30572 node_exporter.go:98] OK: stat success after 0.000157s.
I0518 18:36:07.598528   30572 node_exporter.go:98] OK: stat success after 0.010153s.
I0518 18:36:12.833998   30572 node_exporter.go:98] OK: stat success after 0.276153s.
I0518 18:36:17.852887   30572 node_exporter.go:98] OK: stat success after 0.251907s.
I0518 18:36:22.695873   30572 node_exporter.go:98] OK: stat success after 0.014766s.
I0518 18:36:27.657642   30572 node_exporter.go:98] OK: stat success after 0.016069s.
I0518 18:36:32.658599   30572 node_exporter.go:98] OK: stat success after 0.016960s.
I0518 18:36:37.653351   30572 node_exporter.go:98] OK: stat success after 0.199047s.
I0518 18:36:42.614556   30572 node_exporter.go:98] OK: stat success after 0.006177s.

Looking closer at that time range, I cannot tell what would have caused that slowdown. Things get real bad here

I0518 20:35:56.787647   30572 node_exporter.go:98] OK: netdev success after 0.000339s.
I0518 20:35:56.787674   30572 node_exporter.go:98] OK: time success after 0.000003s.
I0518 20:35:56.787895   30572 node_exporter.go:98] OK: netstat success after 0.000201s.
I0518 20:35:56.788058   30572 node_exporter.go:98] OK: diskstats success after 0.000118s.
I0518 20:35:53.814827   30572 node_exporter.go:98] OK: time success after 0.000011s.
I0518 20:35:25.796216   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:29.001793   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:52.398907   30572 node_exporter.go:98] OK: time success after 0.000008s.
I0518 20:35:22.792778   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:24.086548   30572 node_exporter.go:98] OK: time success after 0.000007s.
I0518 20:35:24.527268   30572 node_exporter.go:98] OK: time success after 0.000006s.
I0518 20:35:50.266417   30572 node_exporter.go:98] OK: filesystem success after 0.000287s.
I0518 20:35:25.796643   30572 node_exporter.go:98] OK: netdev success after 0.000371s.
I0518 20:46:53.967890   30572 node_exporter.go:98] OK: loadavg success after 84.566580s.
I0518 20:35:21.660582   30572 node_exporter.go:98] OK: netdev success after 0.000411s.
I0518 20:53:51.171998   30572 node_exporter.go:98] OK: loadavg success after 977.176057s.
I0518 20:46:54.267224   30572 node_exporter.go:98] OK: loadavg success after 40.160530s.
I0518 20:48:04.293429   30572 node_exporter.go:98] OK: loadavg success after 657.562015s.
I0518 20:47:58.196109   30572 node_exporter.go:98] OK: loadavg success after 699.850948s.
I0518 20:35:50.456721   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:21.493450   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:47:57.299693   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:48.280044   30572 node_exporter.go:98] OK: time success after 0.000007s.
I0518 20:35:20.950916   30572 node_exporter.go:98] OK: time success after 0.000006s.
I0518 20:47:58.196097   30572 node_exporter.go:98] OK: netstat success after 699.830764s.
I0518 20:51:29.740396   30572 node_exporter.go:98] OK: netstat success after 828.727282s.
I0518 20:48:35.090167   30572 node_exporter.go:98] OK: loadavg success after 657.785694s.
I0518 20:53:21.403094   30572 node_exporter.go:98] OK: attributes success after 863.295223s.
I0518 21:09:17.576817   30572 node_exporter.go:98] OK: netstat success after 1889.755354s.
I0518 20:47:57.788156   30572 node_exporter.go:98] OK: loadavg success after 754.159794s.
I0518 20:35:25.386415   30572 node_exporter.go:98] OK: filesystem success after 0.000293s.
I0518 20:58:41.445939   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:49.117876   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:47:56.418723   30572 node_exporter.go:98] OK: time success after 270.894575s.
I0518 20:48:49.139520   30572 node_exporter.go:98] OK: time success after 0.000012s.
I0518 20:52:05.805535   30572 node_exporter.go:98] OK: loadavg success after 698.312665s.
I0518 20:50:58.722561   30572 node_exporter.go:98] OK: loadavg success after 699.930957s.
I0518 20:49:21.125827   30572 node_exporter.go:98] OK: loadavg success after 700.511116s.
I0518 21:00:37.710201   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:19.374094   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 21:02:47.132333   30572 node_exporter.go:98] OK: textfile success after 0.102544s.
I0518 21:02:36.343913   30572 node_exporter.go:98] OK: netstat success after 1433.712689s.
I0518 20:49:43.494054   30572 node_exporter.go:98] OK: loadavg success after 697.772662s.
I0518 21:01:17.995762   30572 node_exporter.go:98] OK: time success after 342.470954s.

I would imagine that something causes a lock and then a whole bunch of go routines pile up? maybe even then cause the memory to go really high, then maybe it starts swapping? I will look into systemd unit swapping.

juliusv commented 9 years ago

Yes, that could be the problem. A couple of us have stared at this a bit, but no conclusion yet :-/

camerondavison commented 9 years ago

just FYI ran curl localhost:7301/debug/pprof/goroutine?debug=2

goroutine 2153562 [running]:
runtime/pprof.writeGoroutineStacks(0x7f1ba1fa80d8, 0xc2087579a0, 0x0, 0x0)
    /usr/local/go/src/runtime/pprof/pprof.go:511 +0x8d
runtime/pprof.writeGoroutine(0x7f1ba1fa80d8, 0xc2087579a0, 0x2, 0x0, 0x0)
    /usr/local/go/src/runtime/pprof/pprof.go:500 +0x4f
runtime/pprof.(*Profile).WriteTo(0xced2c0, 0x7f1ba1fa80d8, 0xc2087579a0, 0x2, 0x0, 0x0)
    /usr/local/go/src/runtime/pprof/pprof.go:229 +0xd5
net/http/pprof.handler.ServeHTTP(0xc2082076f1, 0x9, 0x7f1ba1f85030, 0xc2087579a0, 0xc208173380)
    /usr/local/go/src/net/http/pprof/pprof.go:169 +0x35f
net/http/pprof.Index(0x7f1ba1f85030, 0xc2087579a0, 0xc208173380)
    /usr/local/go/src/net/http/pprof/pprof.go:181 +0x15e
net/http.HandlerFunc.ServeHTTP(0x9a4068, 0x7f1ba1f85030, 0xc2087579a0, 0xc208173380)
    /usr/local/go/src/net/http/server.go:1265 +0x41
net/http.(*ServeMux).ServeHTTP(0xc20803ab10, 0x7f1ba1f85030, 0xc2087579a0, 0xc208173380)
    /usr/local/go/src/net/http/server.go:1541 +0x17d
net/http.serverHandler.ServeHTTP(0xc20805a600, 0x7f1ba1f85030, 0xc2087579a0, 0xc208173380)
    /usr/local/go/src/net/http/server.go:1703 +0x19a
net/http.(*conn).serve(0xc208757900)
    /usr/local/go/src/net/http/server.go:1204 +0xb57
created by net/http.(*Server).Serve
    /usr/local/go/src/net/http/server.go:1751 +0x35e

goroutine 1 [IO wait]:
net.(*pollDesc).Wait(0xc2080d0840, 0x72, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc2080d0840, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).accept(0xc2080d07e0, 0x0, 0x7f1ba1f82dc0, 0xc2085c4ac0)
    /usr/local/go/src/net/fd_unix.go:419 +0x40b
net.(*TCPListener).AcceptTCP(0xc20804c1e8, 0x48f73e, 0x0, 0x0)
    /usr/local/go/src/net/tcpsock_posix.go:234 +0x4e
net/http.tcpKeepAliveListener.Accept(0xc20804c1e8, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:1976 +0x4c
net/http.(*Server).Serve(0xc20805a600, 0x7f1ba1f84e78, 0xc20804c1e8, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:1728 +0x92
net/http.(*Server).ListenAndServe(0xc20805a600, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:1718 +0x154
net/http.ListenAndServe(0x7fff614d5f8e, 0x5, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:1808 +0xba
main.main()
    /home/bjoern/gocode/src/github.com/prometheus/node_exporter/node_exporter.go:185 +0x73e

goroutine 7 [chan receive]:
github.com/golang/glog.(*loggingT).flushDaemon(0xd0e7c0)
    /home/bjoern/gocode/src/github.com/prometheus/node_exporter/.build/gopath/src/github.com/golang/glog/glog.go:879 +0x78
created by github.com/golang/glog.init·1
    /home/bjoern/gocode/src/github.com/prometheus/node_exporter/.build/gopath/src/github.com/golang/glog/glog.go:410 +0x2a7

goroutine 17 [syscall, 11494 minutes, locked to thread]:
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:2232 +0x1

goroutine 8 [syscall, 11494 minutes]:
os/signal.loop()
    /usr/local/go/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init·1
    /usr/local/go/src/os/signal/signal_unix.go:27 +0x35

It is currently using 100MB of memory, where as say the consul_exporters that I am running are running at only 20MB of memory. I feel like things are slowing down as well, the scrape_duration_seconds is steadily climbing and is now at 2.2 seconds.

camerondavison commented 9 years ago

Hmm process_resident_memory_bytes says 12MB ish, and so does ps but cAdvisor is telling me that it is using all 100MB that is allocated for that cgroup.

lsof -p 3662

COMMAND    PID USER   FD   TYPE             DEVICE SIZE/OFF     NODE NAME
node_expo 3662 root  cwd    DIR              202,9     4096        2 /
node_expo 3662 root  rtd    DIR              202,9     4096        2 /
node_expo 3662 root  txt    REG              202,9 10594835       30 /var/lib/prometheus/node_exporter
node_expo 3662 root  mem    REG              202,3    51152    33867 /usr/lib64/libnss_files-2.19.so
node_expo 3662 root  mem    REG              202,3  1787096    33884 /usr/lib64/libc-2.19.so
node_expo 3662 root  mem    REG              202,3   109288    33865 /usr/lib64/libpthread-2.19.so
node_expo 3662 root  mem    REG              202,3   144720    33863 /usr/lib64/ld-2.19.so
node_expo 3662 root    0r   CHR                1,3      0t0     1036 /dev/null
node_expo 3662 root    1u  unix 0xffff88006ecfde80      0t0 31659765 socket
node_expo 3662 root    2u  unix 0xffff88006ecfde80      0t0 31659765 socket
node_expo 3662 root    3u   REG               0,30 92607463 31659776 /tmp/node_exporter.ip-172-31-17-127.root.log.INFO.20150520-162710.3662
node_expo 3662 root    4u  IPv6           31659778      0t0      TCP *:7301 (LISTEN)
node_expo 3662 root    5u  0000               0,10        0     6638 anon_inode
camerondavison commented 9 years ago

Okay. Crazy thing here, I truncated the log file, echo > /tmp/node_exporter.ip-172-31-17-127.root.log.INFO.20150520-162710.3662 and immediately got back all of the memory.

Also the scrape_duration_seconds went right back down to .1

camerondavison commented 9 years ago

I am adding -logtostderr=true and we shall see if in a week this is solved.

beorn7 commented 9 years ago

Perhaps a problem with your /tmp? Disk full? Or /tmp on ramdisk (and also full)?

camerondavison commented 9 years ago

systemd was essentially saying that the log file seemed to be part of the memory accounting for the cgroup (not sure if glog) is doing something like mlocking the log file or not. I am unable to reproduce it with, but /tmp/ is mounted on a 8GB partition and the file was only 90MB or so, so it was not full. My unit file just looks like.

[Unit]
Description=Prometheus Node Exporter Service
[Service]
TimeoutStartSec=10m
MemoryLimit=100M
Restart=always
RestartSec=30
ExecStart=-/var/lib/prometheus/node_exporter\
 -web.listen-address=:7301
juliusv commented 9 years ago

Oh... we should switch node_exporter (like everything else) from glog to https://github.com/prometheus/log, which only logs to stderr anyways. I'll do that, and hope you can already confirm that the issue is gone when logging to stderr!

juliusv commented 9 years ago

https://github.com/prometheus/node_exporter/pull/79 to switch node_exporter to prometheus/log.

juliusv commented 9 years ago

Has this been resolved by the logging change?

camerondavison commented 9 years ago

It has not been a week yet, but I am now seeing the memory stay very steady at < 20MB. I would say it was likely the problem. Thanks for working with me on this.

juliusv commented 9 years ago

Great to hear, thanks!