Open isavcic opened 6 years ago
Thanks for the report. I can reproduce it also.
It looks like a problem with Go using the CGO name resolver. The issue doesn't manifest when the service is started like GODEBUG=netdns=go ./node_exporter --collector.textfile.directory=/etc/prometheus/node_exporter --collector.supervisord
.
I think this is a dupe of https://github.com/prometheus/node_exporter/issues/859, I'm guessing https://github.com/prometheus/node_exporter/pull/978 will fix this.
@SuperQ I'm not sure since the crash happens the second time I hit the /metrics endpoint. Also I've tested with #978 and I can still reproduce the crash.
Starting node_exporter with GODEBUG=netdns=2
:
INFO[0000] Listening on :9100 source="node_exporter.go:76"
go package net: dynamic selection of DNS resolver
go package net: hostLookupOrder(localhost) = cgo
ERRO[0002] ERROR: supervisord collector failed after 0.005999s: Post http://localhost:9001/RPC2: dial tcp 127.0.0.1:9001: connect: connection refused source="collector.go:123"
Does the exporter actually crash? Or does it just error when supervisord is unavailable.
The ERROR seems fine if there is no supervisord.
I didn't paste the full logs but yes it is a crash. Full logs: https://paste.fedoraproject.org/paste/pG4GXk-5i7RnbwOC5TAT4Q
Crashes for me, yes.
Confirming it doesn't crash with GODEBUG=netdns=go
I see a call to getaddrinfo
in the backtrace. If a standard promu build
was used to build the program, then the cause is likely this glibc bug (or one of its cousins):
I expect the problem goes away if you patch promu build
not to pass -extldflags -static
and rebuild.
I confirm that it doesn't crash anymore when built without -extldflags -static
.
Linux simon-laptop 4.17.5-200.fc28.x86_64 #1 SMP Tue Jul 10 13:39:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
FWIW, there's no way to avoid -extldflags -static
with promu currently (see https://github.com/prometheus/promu/blob/23b5b8451e9efd4a4f3e81d4f90a2feabb918ab9/cmd/build.go#L177-L179)
@SuperQ we happen to discuss about Go, Cgo and static linkage with @fweimer yeseterday because we faced issues with our internal builds of node_exporter.
Thanks, I'm no Go build expert. The node_exporter has a few different problems due to the build. We get crashes with 1.10, so the official build is still on 1.9
So more googling progress on this issue. I've found this golang issue and it looks like the crash itself depends on the nsswtich configuration that causes the Go runtime to use the go or cgo resolver.
My /etc/nsswitch.conf is:
passwd: sss files systemd
shadow: files sss
group: sss files systemd
hosts: files mdns4_minimal [NOTFOUND=return] dns myhostname
bootparams: nisplus [NOTFOUND=return] files
ethers: files
netmasks: files
networks: files
protocols: files
rpc: files
services: files sss
netgroup: nisplus sss
publickey: nisplus
automount: files nisplus
aliases: files nisplus
If I remove myhostname
from hosts
(as stated in the linked bug), node_exporter stops crashing.
Again there are probably other ways for node_exporter to crash given that go build
displays warnings during linkage:
/tmp/go-link-489955877/000021.o: In function `mygetgrouplist':
/usr/local/go/src/os/user/getgrouplist_unix.go:15: warning: Using 'getgrouplist' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/tmp/go-link-489955877/000020.o: In function `mygetgrgid_r':
/usr/local/go/src/os/user/cgo_lookup_unix.go:38: warning: Using 'getgrgid_r' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/tmp/go-link-489955877/000020.o: In function `mygetgrnam_r':
/usr/local/go/src/os/user/cgo_lookup_unix.go:43: warning: Using 'getgrnam_r' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/tmp/go-link-489955877/000020.o: In function `mygetpwnam_r':
/usr/local/go/src/os/user/cgo_lookup_unix.go:33: warning: Using 'getpwnam_r' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/tmp/go-link-489955877/000020.o: In function `mygetpwuid_r':
/usr/local/go/src/os/user/cgo_lookup_unix.go:28: warning: Using 'getpwuid_r' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/tmp/go-link-489955877/000006.o: In function `_cgo_f7895c2c5a3a_C2func_getaddrinfo':
/tmp/go-build/cgo-gcc-prolog:46: warning: Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
@isavcic can you share your /etc/nsswitch.conf file?
Sure thing:
passwd: sss files
shadow: files sss
group: sss files
hosts: files dns myhostname
bootparams: nisplus [NOTFOUND=return] files
ethers: files
netmasks: files
networks: files
protocols: files
rpc: files
services: files sss
netgroup: nisplus sss
publickey: nisplus
automount: files nisplus
aliases: files nisplus
@isavcic thanks so this is the same explanation AFAICT.
Could you try building with go1.11 and see if this problem is fixed there?
@discordianfish The problem is the enforced fully static linking in promu build
, a Go update will not change that.
@fweimer But isn't this a issue with netgo's implementation of getaddrinfo that could be very well fixed in the latest versions?
Host operating system: output of
uname -a
Linux example.com 4.14.26-54.32.amzn2.x86_64 #1 SMP Tue Mar 27 21:50:30 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
node_exporter version: output of
node_exporter --version
node_exporter command line flags
./node_exporter-0.16.0.linux-amd64/node_exporter --collector.textfile.directory=/etc/prometheus/node_exporter --collector.supervisord
Are you running node_exporter in Docker?
No.
What did you do that produced an error?
Starting the node_exporter with --collector.supervisord enabled while supervisord isn't listening on 127.0.0.1:9001.
What did you expect to see?
To see an error, but for node_exporter to keep on running.
What did you see instead?
node_exporter panicking with this output.