socket: too many open files / live.sock: bind: address already in use

jgbuenaventura commented 6 years ago

Hi Sven,

Am running an LMD cluster and usually get the error message below if i stop and start one the LMD process in the cluster. LMD will recover but will produce the same error again. My workaround is to kill the connection and starts LMD again. But is there a way for LMD to handle this? Thanks in advance

Specs: LMD 1.3.0 cluster (2 LMD server and a thruk Server) Thruk Version 2.16~2 Nagios 4.3.1 and 4.1.1 MK-livestatus 1.2.6p15

Error Log:

No Backend available None of the configured Backends could be reached, please have a look at the logfile for detailed information and make sure the core is up and running.

Details: /var/cache/thruk/lmd/live.sock: The request contains an invalid header. - Post http://172.26.66.70:8080/query: dial tcp 172.26.66.70:8080: socket: too many open files at /usr/share/thruk/lib/Monitoring/Livestatus/Class/Lite.pm line 380

net/http.(connReader).Read(0xc422332600, 0xc422342000, 0x1000, 0x1000, 0xc422193d38, 0x5b19fc, 0xc42211de80) /usr/lib/golang/src/net/http/server.go:753 +0x105 bufio.(Reader).fill(0xc42217cde0) /usr/lib/golang/src/bufio/bufio.go:97 +0x11a bufio.(Reader).Peek(0xc42217cde0, 0x4, 0x1895617d4, 0xba39e0, 0x0, 0x0, 0xba39e0) /usr/lib/golang/src/bufio/bufio.go:129 +0x3a net/http.(conn).serve(0xc422095040, 0xb685e0, 0xc42217a780) /usr/lib/golang/src/net/http/server.go:1826 +0x88f created by net/http.(*Server).Serve /usr/lib/golang/src/net/http/server.go:2720 +0x288

[2018-06-14 10:41:05][Info][peer.go:677] [nagt99] updating objects failed after: 214.98µs: dial tcp 172.26.66.208:6557: socket: too many open files [2018-06-14 10:41:05][Info][main.go:465] got sigint, quitting [2018-06-14 10:41:05][Info][listener.go:327] stopping listener on :8080 [2018-06-14 10:41:05][Info][listener.go:253] stopping unix listener on /var/cache/thruk/lmd/live.sock [2018-06-14 10:41:05][Info][listener.go:266] unix listener /var/cache/thruk/lmd/live.sock shutdown complete [2018-06-14 10:41:05][Info][listener.go:253] stopping unix listener on /var/cache/thruk/lmd/live.sock [2018-06-14 10:41:05][Info][listener.go:266] unix listener /var/cache/thruk/lmd/live.sock shutdown complete [2018-06-14 10:41:05][Warn][response.go:457] sending error response: 400 - Post http://172.26.66.201:8080/query: read tcp 172.26.66.70:51118->172.26.66.201:8080: read: connection reset by peer [2018-06-14 10:41:12][Info][listener.go:248] listening for incoming queries on unix /var/cache/thruk/lmd/live.sock [2018-06-14 10:41:12][Fatal][listener.go:240] listen error: listen unix /var/cache/thruk/lmd/live.sock: bind: address already in use

[2018-06-14 10:43:51][Warn][response.go:457] sending error response: 400 - Post http://172.26.66.70:8080/query: dial tcp 172.26.66.70:8080: socket: too many open files

sni commented 6 years ago

Can you check the connections with lsof -p <pidof lmd> when the issue occurs? Do the connections look reasonable? Does the number of connections grow over time?

jgbuenaventura commented 6 years ago

hi sven, thanks for the quick answer, i'll try to reproduce the error again, and will give u a feedback :)

jgbuenaventura commented 6 years ago

Hi Sven,

Here's the error from Thruk Dashboard

None of the configured Backends could be reached, please have a look at the logfile for detailed information and make sure the core is up and running.

Details: /var/cache/thruk/lmd/live.sock: The request contains an invalid header. - Post http://dkrdswpplmdp01:8080/query: dial tcp dkrdswpplmdp01:8080: socket: too many open files at /usr/share/thruk/lib/Monitoring/Livestatus/Class/Lite.pm line 380

here's the lsof output

[root@dkrdswpplmdp01 ~]# lsof -p $(pidof lmd) COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME lmd 1992 root cwd DIR 253,0 4096 2 / lmd 1992 root rtd DIR 253,0 4096 2 / lmd 1992 root txt REG 253,0 11657349 798799 /opt/local/go/bin/lmd lmd 1992 root mem REG 253,0 1924768 1438991 /lib64/libc-2.12.so lmd 1992 root mem REG 253,0 143280 1439049 /lib64/libpthread-2.12.so lmd 1992 root mem REG 253,0 159312 1439135 /lib64/ld-2.12.so lmd 1992 root 0r CHR 1,3 0t0 3881 /dev/null lmd 1992 root 1w CHR 1,3 0t0 3881 /dev/null lmd 1992 root 2w CHR 1,3 0t0 3881 /dev/null lmd 1992 root 3w REG 253,2 2253651 9699521 /var/log/lmd.log lmd 1992 root 4u REG 0,9 0 3877 [eventpoll] lmd 1992 root 5u IPv6 13319 0t0 TCP *:webcache (LISTEN) lmd 1992 root 6u unix 0xffff880630d30b80 0t0 13320 /var/cache/thruk/lmd/live.sock lmd 1992 root 7r CHR 1,9 0t0 3886 /dev/urandom lmd 1992 root 9u IPv4 21478 0t0 TCP dkrdswpplmdp01.vestasext.net:39238->dkrdswpplmdp02.vestasext.net:webcache (ESTABLISHED) lmd 1992 root 10u IPv4 21479 0t0 TCP dkrdswpplmdp01.vestasext.net:34216->dkrdswppthrukt01:webcache (ESTABLISHED) lmd 1992 root 11u IPv6 21480 0t0 TCP dkrdswpplmdp01.vestasext.net:webcache->dkrdswppthrukt01:51732 (ESTABLISHED) [root@dkrdswpplmdp01 ~]#

The number of connection doesnt seem to grow overtime and the connection looks reasonable, but it doesnt gives me an error on the thruk dashboard

jgbuenaventura commented 6 years ago

whooppppsss seems to be growing

pthrukt01:webcache (ESTABLISHED) lmd 1992 root 1008u IPv6 111375 0t0 TCP dkrdswpplmdp01.vestasext.net:webcache->dkr dswppthrukt01:46744 (ESTABLISHED) lmd 1992 root 1009u IPv4 111377 0t0 TCP dkrdswpplmdp01.vestasext.net:36816->dkrdsw ppthrukt01:webcache (ESTABLISHED) lmd 1992 root 1010u IPv6 111378 0t0 TCP dkrdswpplmdp01.vestasext.net:webcache->dkr dswppthrukt01:46746 (ESTABLISHED) lmd 1992 root 1011u IPv4 111380 0t0 TCP dkrdswpplmdp01.vestasext.net:36818->dkrdsw ppthrukt01:webcache (ESTABLISHED) lmd 1992 root 1012u IPv6 111381 0t0 TCP dkrdswpplmdp01.vestasext.net:webcache->dkr dswppthrukt01:46748 (ESTABLISHED) lmd 1992 root 1013u IPv6 109855 0t0 TCP dkrdswpplmdp01.vestasext.net:webcache->dkr dswppthrukt01:45726 (ESTABLISHED) lmd 1992 root 1014u IPv4 109857 0t0 TCP dkrdswpplmdp01.vestasext.net:35812->dkrdsw ppthrukt01:webcache (ESTABLISHED) lmd 1992 root 1015u IPv6 109858 0t0 TCP dkrdswpplmdp01.vestasext.net:webcache->dkr dswppthrukt01:45728 (ESTABLISHED) lmd 1992 root 1016u IPv4 109860 0t0 TCP dkrdswpplmdp01.vestasext.net:35814->dkrdsw ppthrukt01:webcache (ESTABLISHED) lmd 1992 root 1017u IPv4 111383 0t0 TCP dkrdswpplmdp01.vestasext.net:36820->dkrdsw ppthrukt01:webcache (ESTABLISHED) lmd 1992 root 1018u IPv6 111384 0t0 TCP dkrdswpplmdp01.vestasext.net:webcache->dkr dswppthrukt01:46750 (ESTABLISHED) lmd 1992 root 1019u IPv4 111386 0t0 TCP dkrdswpplmdp01.vestasext.net:36822->dkrdsw ppthrukt01:webcache (ESTABLISHED) lmd 1992 root 1021u IPv4 111396 0t0 TCP dkrdswpplmdp01.vestasext.net:41860->dkrdsw pplmdp02.vestasext.net:webcache (ESTABLISHED) lmd 1992 root 1022u IPv4 111397 0t0 TCP dkrdswpplmdp01.vestasext.net:36840->dkrdsw ppthrukt01:webcache (ESTABLISHED)

sni commented 6 years ago

This seems to be connections between the 2 LMD nodes, right?

jgbuenaventura commented 6 years ago

yep 2 connection from another LMD node. it is also the thruk server

sni commented 5 years ago

i reworked cluster things, so you might want to check again. It seemed like under certain circumstances clustered requests led to an endless loop which then resulted in this behavior.

sni / lmd

socket: too many open files / live.sock: bind: address already in use #34