prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11.05k stars 2.34k forks source link

Panic and crash when supervisord collector enabled, yet supervisord not listening on 127.0.0.1:9001 #1007

Open isavcic opened 6 years ago

isavcic commented 6 years ago

Host operating system: output of uname -a

Linux example.com 4.14.26-54.32.amzn2.x86_64 #1 SMP Tue Mar 27 21:50:30 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 0.16.0 (branch: HEAD, revision: d42bd70f4363dced6b77d8fc311ea57b63387e4f)
  build user:       root@a67a9bc13a69
  build date:       20180515-15:52:42
  go version:       go1.9.6

node_exporter command line flags

./node_exporter-0.16.0.linux-amd64/node_exporter --collector.textfile.directory=/etc/prometheus/node_exporter --collector.supervisord

Are you running node_exporter in Docker?

No.

What did you do that produced an error?

Starting the node_exporter with --collector.supervisord enabled while supervisord isn't listening on 127.0.0.1:9001.

What did you expect to see?

To see an error, but for node_exporter to keep on running.

What did you see instead?

node_exporter panicking with this output.

simonpasquier commented 6 years ago

Thanks for the report. I can reproduce it also.

simonpasquier commented 6 years ago

It looks like a problem with Go using the CGO name resolver. The issue doesn't manifest when the service is started like GODEBUG=netdns=go ./node_exporter --collector.textfile.directory=/etc/prometheus/node_exporter --collector.supervisord.

SuperQ commented 6 years ago

I think this is a dupe of https://github.com/prometheus/node_exporter/issues/859, I'm guessing https://github.com/prometheus/node_exporter/pull/978 will fix this.

simonpasquier commented 6 years ago

@SuperQ I'm not sure since the crash happens the second time I hit the /metrics endpoint. Also I've tested with #978 and I can still reproduce the crash.

Starting node_exporter with GODEBUG=netdns=2:

INFO[0000] Listening on :9100                            source="node_exporter.go:76"
go package net: dynamic selection of DNS resolver
go package net: hostLookupOrder(localhost) = cgo
ERRO[0002] ERROR: supervisord collector failed after 0.005999s: Post http://localhost:9001/RPC2: dial tcp 127.0.0.1:9001: connect: connection refused  source="collector.go:123"
SuperQ commented 6 years ago

Does the exporter actually crash? Or does it just error when supervisord is unavailable.

The ERROR seems fine if there is no supervisord.

simonpasquier commented 6 years ago

I didn't paste the full logs but yes it is a crash. Full logs: https://paste.fedoraproject.org/paste/pG4GXk-5i7RnbwOC5TAT4Q

isavcic commented 6 years ago

Crashes for me, yes.

isavcic commented 6 years ago

Confirming it doesn't crash with GODEBUG=netdns=go

fweimer commented 6 years ago

I see a call to getaddrinfo in the backtrace. If a standard promu build was used to build the program, then the cause is likely this glibc bug (or one of its cousins):

I expect the problem goes away if you patch promu build not to pass -extldflags -static and rebuild.

simonpasquier commented 6 years ago

I confirm that it doesn't crash anymore when built without -extldflags -static.

Linux simon-laptop 4.17.5-200.fc28.x86_64 #1 SMP Tue Jul 10 13:39:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

FWIW, there's no way to avoid -extldflags -static with promu currently (see https://github.com/prometheus/promu/blob/23b5b8451e9efd4a4f3e81d4f90a2feabb918ab9/cmd/build.go#L177-L179)

@SuperQ we happen to discuss about Go, Cgo and static linkage with @fweimer yeseterday because we faced issues with our internal builds of node_exporter.

SuperQ commented 6 years ago

Thanks, I'm no Go build expert. The node_exporter has a few different problems due to the build. We get crashes with 1.10, so the official build is still on 1.9

simonpasquier commented 6 years ago

So more googling progress on this issue. I've found this golang issue and it looks like the crash itself depends on the nsswtich configuration that causes the Go runtime to use the go or cgo resolver.

My /etc/nsswitch.conf is:

passwd:      sss files systemd
shadow:     files sss
group:       sss files systemd
hosts:      files mdns4_minimal [NOTFOUND=return] dns myhostname
bootparams: nisplus [NOTFOUND=return] files
ethers:     files
netmasks:   files
networks:   files
protocols:  files
rpc:        files
services:   files sss
netgroup:   nisplus sss
publickey:  nisplus
automount:  files nisplus
aliases:    files nisplus

If I remove myhostname from hosts (as stated in the linked bug), node_exporter stops crashing.

Again there are probably other ways for node_exporter to crash given that go build displays warnings during linkage:

/tmp/go-link-489955877/000021.o: In function `mygetgrouplist':
/usr/local/go/src/os/user/getgrouplist_unix.go:15: warning: Using 'getgrouplist' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/tmp/go-link-489955877/000020.o: In function `mygetgrgid_r':
/usr/local/go/src/os/user/cgo_lookup_unix.go:38: warning: Using 'getgrgid_r' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/tmp/go-link-489955877/000020.o: In function `mygetgrnam_r':
/usr/local/go/src/os/user/cgo_lookup_unix.go:43: warning: Using 'getgrnam_r' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/tmp/go-link-489955877/000020.o: In function `mygetpwnam_r':
/usr/local/go/src/os/user/cgo_lookup_unix.go:33: warning: Using 'getpwnam_r' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/tmp/go-link-489955877/000020.o: In function `mygetpwuid_r':
/usr/local/go/src/os/user/cgo_lookup_unix.go:28: warning: Using 'getpwuid_r' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/tmp/go-link-489955877/000006.o: In function `_cgo_f7895c2c5a3a_C2func_getaddrinfo':
/tmp/go-build/cgo-gcc-prolog:46: warning: Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking

@isavcic can you share your /etc/nsswitch.conf file?

isavcic commented 6 years ago

Sure thing:

passwd:     sss files
shadow:     files sss
group:      sss files
hosts:      files dns myhostname
bootparams: nisplus [NOTFOUND=return] files
ethers:     files
netmasks:   files
networks:   files
protocols:  files
rpc:        files
services:   files sss
netgroup:   nisplus sss
publickey:  nisplus
automount:  files nisplus
aliases:    files nisplus
simonpasquier commented 6 years ago

@isavcic thanks so this is the same explanation AFAICT.

discordianfish commented 6 years ago

Could you try building with go1.11 and see if this problem is fixed there?

fweimer commented 6 years ago

@discordianfish The problem is the enforced fully static linking in promu build, a Go update will not change that.

discordianfish commented 6 years ago

@fweimer But isn't this a issue with netgo's implementation of getaddrinfo that could be very well fixed in the latest versions?