Closed camerondavison closed 9 years ago
After restarting it everything looks rather normal again. I tried to do a kill -QUIT and get a stack trace, but the trace itself was printing incredibly slowly and then I think that my systemd definitions killed it and restarted it in the middle. So I did not get anything.
Before I restarted it I ran strace -f -p $PID and it was only printing
Process 25359 attached with 7 threads
[pid 5361] futex(0xc2085680d8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 23093] futex(0xc20837c458, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 25364] futex(0xc208021258, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 25363] futex(0xd18520, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 25359] futex(0xd0f378, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631679, 472159297}) = 0
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631679, 472233554}) = 0
[pid 25362] clock_gettime(CLOCK_REALTIME, {1431543852, 511458926}) = 0
[pid 25362] epoll_wait(5, {}, 128, 0) = 0
[pid 25362] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631680, 423690670}) = 0
[pid 25362] futex(0xd0e7a0, FUTEX_WAIT, 0, {60, 0}) = -1 ETIMEDOUT (Connection timed out)
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631741, 549970095}) = 0
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631741, 816163981}) = 0
[pid 25362] clock_gettime(CLOCK_REALTIME, {1431543914, 227795643}) = 0
[pid 25362] epoll_wait(5, {}, 128, 0) = 0
[pid 25362] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631742, 140082541}) = 0
[pid 25362] futex(0xd0e7a0, FUTEX_WAIT, 0, {60, 0}) = -1 ETIMEDOUT (Connection timed out)
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631803, 334995979}) = 0
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631803, 335140475}) = 0
[pid 25362] clock_gettime(CLOCK_REALTIME, {1431543975, 609309540}) = 0
[pid 25362] epoll_wait(5, {}, 128, 0) = 0
[pid 25362] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631803, 521924470}) = 0
[pid 25362] futex(0xd0e7a0, FUTEX_WAIT, 0, {60, 0}
Once I restarted it it was printing the expected open/read of the file descriptors every 5 seconds when my prometheus app polled it for data
Strange! Could you try to get the output of http://<host:port>/debug/pprof/goroutine?debug=2
next time this happens?
No good. I cannot make that curl, it just hangs.
I was able to send it a sig QUIT this time, still waiting on the output, but what I have so far https://gist.github.com/a86c6f7964/f170cabd622f64c76d2e
@a86c6f7964 We've been staring at this for a short while, and still not sure exactly what's going on, besides some goroutines being slow / blocked for long times. Hanging goroutines could be some deadlock bug, but that wouldn't really explain the high IO. Is the machine swapping when this happens? Do you happen to have process and/or machine memory metrics? (well, actually you should have both up until the point of total unresponsiveness, if there was a Prometheus scraping that node exporter)
Interesting metrics to look at for the node-exporter process itself, which should be in the Prometheus that scraped it, up until the point when the node exporter became unscrapable:
process_goroutines
process_resident_memory_bytes
and process_virtual_memory_bytes
process_cpu_seconds_total
(the rate over that)process_open_fds
...and machine-related metrics that would be interesting for the host where the node exporter ran on:
node_memory_MemFree
node_memory_Cached
node_memory_Buffers
node_cpu
(rate over this one)node_load1
node_disk_io_now
(in-progress disk IOs)node_network_receive_bytes
Okay I updated the gist with the full stack trace (it took almost 8 hours to generate :open_mouth: )
My prometheus only handles 4 hours worth of data (disk io got reallllllly slow when I gave it more, we are running on a AWS t2.medium machine) ... so those will be kind of hard to get at
The usual load on this machine is next to nothing. I have the node exporter set at a memory limit of 50M in my systemd unit (do you think that it needs more?)
Looking through the logs it looks like something happened around 'I0518 18:36:07' where things started getting slower
I0518 18:35:17.344805 30572 node_exporter.go:98] OK: stat success after 0.000160s.
I0518 18:35:22.345006 30572 node_exporter.go:98] OK: stat success after 0.000162s.
I0518 18:35:27.347632 30572 node_exporter.go:98] OK: stat success after 0.000157s.
I0518 18:35:32.344479 30572 node_exporter.go:98] OK: stat success after 0.000167s.
I0518 18:35:37.346816 30572 node_exporter.go:98] OK: stat success after 0.000159s.
I0518 18:35:42.345793 30572 node_exporter.go:98] OK: stat success after 0.000169s.
I0518 18:35:47.345620 30572 node_exporter.go:98] OK: stat success after 0.000162s.
I0518 18:35:52.348205 30572 node_exporter.go:98] OK: stat success after 0.000185s.
I0518 18:35:57.344761 30572 node_exporter.go:98] OK: stat success after 0.000172s.
I0518 18:36:02.345716 30572 node_exporter.go:98] OK: stat success after 0.000157s.
I0518 18:36:07.598528 30572 node_exporter.go:98] OK: stat success after 0.010153s.
I0518 18:36:12.833998 30572 node_exporter.go:98] OK: stat success after 0.276153s.
I0518 18:36:17.852887 30572 node_exporter.go:98] OK: stat success after 0.251907s.
I0518 18:36:22.695873 30572 node_exporter.go:98] OK: stat success after 0.014766s.
I0518 18:36:27.657642 30572 node_exporter.go:98] OK: stat success after 0.016069s.
I0518 18:36:32.658599 30572 node_exporter.go:98] OK: stat success after 0.016960s.
I0518 18:36:37.653351 30572 node_exporter.go:98] OK: stat success after 0.199047s.
I0518 18:36:42.614556 30572 node_exporter.go:98] OK: stat success after 0.006177s.
Looking closer at that time range, I cannot tell what would have caused that slowdown. Things get real bad here
I0518 20:35:56.787647 30572 node_exporter.go:98] OK: netdev success after 0.000339s.
I0518 20:35:56.787674 30572 node_exporter.go:98] OK: time success after 0.000003s.
I0518 20:35:56.787895 30572 node_exporter.go:98] OK: netstat success after 0.000201s.
I0518 20:35:56.788058 30572 node_exporter.go:98] OK: diskstats success after 0.000118s.
I0518 20:35:53.814827 30572 node_exporter.go:98] OK: time success after 0.000011s.
I0518 20:35:25.796216 30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:29.001793 30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:52.398907 30572 node_exporter.go:98] OK: time success after 0.000008s.
I0518 20:35:22.792778 30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:24.086548 30572 node_exporter.go:98] OK: time success after 0.000007s.
I0518 20:35:24.527268 30572 node_exporter.go:98] OK: time success after 0.000006s.
I0518 20:35:50.266417 30572 node_exporter.go:98] OK: filesystem success after 0.000287s.
I0518 20:35:25.796643 30572 node_exporter.go:98] OK: netdev success after 0.000371s.
I0518 20:46:53.967890 30572 node_exporter.go:98] OK: loadavg success after 84.566580s.
I0518 20:35:21.660582 30572 node_exporter.go:98] OK: netdev success after 0.000411s.
I0518 20:53:51.171998 30572 node_exporter.go:98] OK: loadavg success after 977.176057s.
I0518 20:46:54.267224 30572 node_exporter.go:98] OK: loadavg success after 40.160530s.
I0518 20:48:04.293429 30572 node_exporter.go:98] OK: loadavg success after 657.562015s.
I0518 20:47:58.196109 30572 node_exporter.go:98] OK: loadavg success after 699.850948s.
I0518 20:35:50.456721 30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:21.493450 30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:47:57.299693 30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:48.280044 30572 node_exporter.go:98] OK: time success after 0.000007s.
I0518 20:35:20.950916 30572 node_exporter.go:98] OK: time success after 0.000006s.
I0518 20:47:58.196097 30572 node_exporter.go:98] OK: netstat success after 699.830764s.
I0518 20:51:29.740396 30572 node_exporter.go:98] OK: netstat success after 828.727282s.
I0518 20:48:35.090167 30572 node_exporter.go:98] OK: loadavg success after 657.785694s.
I0518 20:53:21.403094 30572 node_exporter.go:98] OK: attributes success after 863.295223s.
I0518 21:09:17.576817 30572 node_exporter.go:98] OK: netstat success after 1889.755354s.
I0518 20:47:57.788156 30572 node_exporter.go:98] OK: loadavg success after 754.159794s.
I0518 20:35:25.386415 30572 node_exporter.go:98] OK: filesystem success after 0.000293s.
I0518 20:58:41.445939 30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:49.117876 30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:47:56.418723 30572 node_exporter.go:98] OK: time success after 270.894575s.
I0518 20:48:49.139520 30572 node_exporter.go:98] OK: time success after 0.000012s.
I0518 20:52:05.805535 30572 node_exporter.go:98] OK: loadavg success after 698.312665s.
I0518 20:50:58.722561 30572 node_exporter.go:98] OK: loadavg success after 699.930957s.
I0518 20:49:21.125827 30572 node_exporter.go:98] OK: loadavg success after 700.511116s.
I0518 21:00:37.710201 30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:19.374094 30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 21:02:47.132333 30572 node_exporter.go:98] OK: textfile success after 0.102544s.
I0518 21:02:36.343913 30572 node_exporter.go:98] OK: netstat success after 1433.712689s.
I0518 20:49:43.494054 30572 node_exporter.go:98] OK: loadavg success after 697.772662s.
I0518 21:01:17.995762 30572 node_exporter.go:98] OK: time success after 342.470954s.
I would imagine that something causes a lock and then a whole bunch of go routines pile up? maybe even then cause the memory to go really high, then maybe it starts swapping? I will look into systemd unit swapping.
Yes, that could be the problem. A couple of us have stared at this a bit, but no conclusion yet :-/
just FYI ran
curl localhost:7301/debug/pprof/goroutine?debug=2
goroutine 2153562 [running]:
runtime/pprof.writeGoroutineStacks(0x7f1ba1fa80d8, 0xc2087579a0, 0x0, 0x0)
/usr/local/go/src/runtime/pprof/pprof.go:511 +0x8d
runtime/pprof.writeGoroutine(0x7f1ba1fa80d8, 0xc2087579a0, 0x2, 0x0, 0x0)
/usr/local/go/src/runtime/pprof/pprof.go:500 +0x4f
runtime/pprof.(*Profile).WriteTo(0xced2c0, 0x7f1ba1fa80d8, 0xc2087579a0, 0x2, 0x0, 0x0)
/usr/local/go/src/runtime/pprof/pprof.go:229 +0xd5
net/http/pprof.handler.ServeHTTP(0xc2082076f1, 0x9, 0x7f1ba1f85030, 0xc2087579a0, 0xc208173380)
/usr/local/go/src/net/http/pprof/pprof.go:169 +0x35f
net/http/pprof.Index(0x7f1ba1f85030, 0xc2087579a0, 0xc208173380)
/usr/local/go/src/net/http/pprof/pprof.go:181 +0x15e
net/http.HandlerFunc.ServeHTTP(0x9a4068, 0x7f1ba1f85030, 0xc2087579a0, 0xc208173380)
/usr/local/go/src/net/http/server.go:1265 +0x41
net/http.(*ServeMux).ServeHTTP(0xc20803ab10, 0x7f1ba1f85030, 0xc2087579a0, 0xc208173380)
/usr/local/go/src/net/http/server.go:1541 +0x17d
net/http.serverHandler.ServeHTTP(0xc20805a600, 0x7f1ba1f85030, 0xc2087579a0, 0xc208173380)
/usr/local/go/src/net/http/server.go:1703 +0x19a
net/http.(*conn).serve(0xc208757900)
/usr/local/go/src/net/http/server.go:1204 +0xb57
created by net/http.(*Server).Serve
/usr/local/go/src/net/http/server.go:1751 +0x35e
goroutine 1 [IO wait]:
net.(*pollDesc).Wait(0xc2080d0840, 0x72, 0x0, 0x0)
/usr/local/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc2080d0840, 0x0, 0x0)
/usr/local/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).accept(0xc2080d07e0, 0x0, 0x7f1ba1f82dc0, 0xc2085c4ac0)
/usr/local/go/src/net/fd_unix.go:419 +0x40b
net.(*TCPListener).AcceptTCP(0xc20804c1e8, 0x48f73e, 0x0, 0x0)
/usr/local/go/src/net/tcpsock_posix.go:234 +0x4e
net/http.tcpKeepAliveListener.Accept(0xc20804c1e8, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:1976 +0x4c
net/http.(*Server).Serve(0xc20805a600, 0x7f1ba1f84e78, 0xc20804c1e8, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:1728 +0x92
net/http.(*Server).ListenAndServe(0xc20805a600, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:1718 +0x154
net/http.ListenAndServe(0x7fff614d5f8e, 0x5, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:1808 +0xba
main.main()
/home/bjoern/gocode/src/github.com/prometheus/node_exporter/node_exporter.go:185 +0x73e
goroutine 7 [chan receive]:
github.com/golang/glog.(*loggingT).flushDaemon(0xd0e7c0)
/home/bjoern/gocode/src/github.com/prometheus/node_exporter/.build/gopath/src/github.com/golang/glog/glog.go:879 +0x78
created by github.com/golang/glog.init·1
/home/bjoern/gocode/src/github.com/prometheus/node_exporter/.build/gopath/src/github.com/golang/glog/glog.go:410 +0x2a7
goroutine 17 [syscall, 11494 minutes, locked to thread]:
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:2232 +0x1
goroutine 8 [syscall, 11494 minutes]:
os/signal.loop()
/usr/local/go/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init·1
/usr/local/go/src/os/signal/signal_unix.go:27 +0x35
It is currently using 100MB of memory, where as say the consul_exporters that I am running are running at only 20MB of memory. I feel like things are slowing down as well, the scrape_duration_seconds
is steadily climbing and is now at 2.2 seconds.
Hmm process_resident_memory_bytes
says 12MB ish, and so does ps
but cAdvisor is telling me that it is using all 100MB that is allocated for that cgroup.
lsof -p 3662
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
node_expo 3662 root cwd DIR 202,9 4096 2 /
node_expo 3662 root rtd DIR 202,9 4096 2 /
node_expo 3662 root txt REG 202,9 10594835 30 /var/lib/prometheus/node_exporter
node_expo 3662 root mem REG 202,3 51152 33867 /usr/lib64/libnss_files-2.19.so
node_expo 3662 root mem REG 202,3 1787096 33884 /usr/lib64/libc-2.19.so
node_expo 3662 root mem REG 202,3 109288 33865 /usr/lib64/libpthread-2.19.so
node_expo 3662 root mem REG 202,3 144720 33863 /usr/lib64/ld-2.19.so
node_expo 3662 root 0r CHR 1,3 0t0 1036 /dev/null
node_expo 3662 root 1u unix 0xffff88006ecfde80 0t0 31659765 socket
node_expo 3662 root 2u unix 0xffff88006ecfde80 0t0 31659765 socket
node_expo 3662 root 3u REG 0,30 92607463 31659776 /tmp/node_exporter.ip-172-31-17-127.root.log.INFO.20150520-162710.3662
node_expo 3662 root 4u IPv6 31659778 0t0 TCP *:7301 (LISTEN)
node_expo 3662 root 5u 0000 0,10 0 6638 anon_inode
Okay. Crazy thing here, I truncated the log file,
echo > /tmp/node_exporter.ip-172-31-17-127.root.log.INFO.20150520-162710.3662
and immediately got back all of the memory.
Also the scrape_duration_seconds
went right back down to .1
I am adding -logtostderr=true
and we shall see if in a week this is solved.
Perhaps a problem with your /tmp? Disk full? Or /tmp on ramdisk (and also full)?
systemd was essentially saying that the log file seemed to be part of the memory accounting for the cgroup (not sure if glog) is doing something like mlocking the log file or not. I am unable to reproduce it with, but /tmp/ is mounted on a 8GB partition and the file was only 90MB or so, so it was not full. My unit file just looks like.
[Unit]
Description=Prometheus Node Exporter Service
[Service]
TimeoutStartSec=10m
MemoryLimit=100M
Restart=always
RestartSec=30
ExecStart=-/var/lib/prometheus/node_exporter\
-web.listen-address=:7301
Oh... we should switch node_exporter (like everything else) from glog to https://github.com/prometheus/log, which only logs to stderr anyways. I'll do that, and hope you can already confirm that the issue is gone when logging to stderr!
https://github.com/prometheus/node_exporter/pull/79 to switch node_exporter to prometheus/log.
Has this been resolved by the logging change?
It has not been a week yet, but I am now seeing the memory stay very steady at < 20MB. I would say it was likely the problem. Thanks for working with me on this.
Great to hear, thanks!
I have been running the node_exporter version 0.8.0 for a while now. Maybe 2 weeks, and I am now seeing a pretty high IO load. If I run iotop then I am seeing it reading megabytes of data in a matter of seconds. It is basically using 99% of the io on the system.
If I run lsof -p $PID then I see that there are a lot of open file descriptors to /pro/stat and /proc/$PID/limits and /proc/meminfo and /proc/$PID/stat
if this helps
sudo lsof -p 25359 | awk '{print $9}' | sort | uniq -c | sort -n