Closed daenney closed 5 years ago
@daenney ah, so it is; I didn't read closely enough. Should I file a new issue for that?
@austin987 I'm not sure. Might be good for visibility though, so we don't forget about tracking that one down too, we risk it getting burried in the rest of this issue. They feel related to me.
I posted this previously on April 2nd, I had disabled infiniband, zfs, wifi, nfs and nfsd and I've had rc.1 running on multiple servers with no issues since then. Which would lead me to believe the issue lies in one of those collectors.
Those collectors are active on my system but aren't actually doing anything since I don't have any infiniband, ZFS, nfs(d) active. It bails out in the first check in Update()
of each collector which at least in my case would make them unlikely to be involved in that crash.
I do have WiFi since there's a built-in WiFi thing in this box. The only metric collected is node_wifi_interface_frequency_hertz
, it's not actually active or scanning or connected to any network. Looking at that code I'm having a hard time seeing how it itself would cause such an error. The one stack trace we do seem to have from @MorphBonehunter doesn't mention anything related to the wifi collector in the crash which I would expect to see if it was somehow involved.
any progress on this? Any temporary fix?
The temporary fix is to build with Go 1.9.x.
I'm honestly a bit stumped though. I'm not sure where to go from here and golang/go#25128 kinda feels dead in the water (though I suppose that have to do with things ramping up to Go 1.11).
If anyone has suggestion on what to try, I'm all ears. I can build/run whatever and try and crash it, but it would help if we had osme kind of strategy to narrow it down.
I guess we could run a bisect between Go 1.9.x and 1.10 and see when the crash starts to appear?
@daenney Yes, bisect between 1.9.x and 1.10 would be what I would suggest to. Would be awesome if you could look into this!
Alright, looks like something started moving in golang/go#25128 and with golang/go@4991bc6257a9e9d922f7b6e29e393d764c4e4295 we might be able to get a proper error trace out of it. I'll see about giving that a try first before I dive into bisecting Go 1.9 to 1.10, as that'll probably take a long time to narrow down given that it takes some time to trigger the error.
I stumbled into this issue by accident and took a quick look around in the code, so ignore when I say something completely stupid, since I have no in-depth understanding.
When VM and/or heisenbug with varying timing, I usually suspect time drift problems. There are ways to force time problems e.g. ntpdsim, but I didn't get to that. Anyways, this suspicion led me into https://github.com/prometheus/node_exporter/blob/b62c7bc0ad37373b911fa45937348f3c594ce19d/collector/ntp.go#L148... but, shouldn't leapMidnight be protected somehow? Because, as far as I can understand Update may be called concurrently.
@egonelbre Good catch, yes, that does appear to need a mutex on it. Probably unrelated to this specific crash, as we're pretty sure it's in the CGO code, and that collector is pure go. I'll file a separate issue for that.
Has anyone tried building this with Go 1.11? I'm short on time right now to try and do a full bisect between Go 1.9 and 1.10, but if it happens to be gone with 1.11 I'd be good with that :smile:.
I will try running under under the Go 1.11 beta and put our node_exporters up for the test. We have had 1-2 crashes per week on ~70 instances. I'll report back in a week.
Oh hey, you might remember me from that other time node_exporter found a Golang bug. I just hit this one too:
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: fatal error: runtime·unlock: lock countruntime·lock: lock count
panic during panic
stack trace unavailable
runtime stack:
This is Gentoo again, versions:
node_exporter, version 0.16.0 (branch: non-git, revision: d42bd70)
build user: portage@binhost
build date: 20180619-21:57:10
go version: go1.10.1
Unfortunately this just happened on a VM with many siblings running on the same hardware, after days of uptime, and I will only have access to this hardware for a couple more days, so it's not looking good for reproducibility. However, I'll be running these VM images at home and see if I can get it to repro on other hardware.
@marcan Welcome back! :joy:
It's reasonably easy to reproduce. Slamming the node_exporter with vegeta with like 3QPS will crash it in ~30min.
But nobody's been able to bisect enough things or get enough trace info to figure out so far. :(
Just in case hardware is related to reproducibility, I'll dump my stack info here:
(32-thread host)
(16-CPU guest)
I'll be testing the same exact guest VM image again later, but I won't be able to get ahold of identical server hardware (but I can try on a bunch of different hosts I do have access to). Hopefully I can spend some time tracking it down. Unfortunately it's less likely I'll be able to find time to do it on the original hardware where I saw the crash, before I lose access.
So far, it doesn't seem hardware related, only CGO issue with 1.10
Unfortunately, I'm also seeing the issue with Go 1.11beta2.
Well, this just happened on another server of mine, so reproducibility won't be an issue. This time it spat out >95000 lines (!) of "fatal error: runtime·unlock: lock count" into the log.
node_exporter, version 0.16.0 (branch: non-git, revision: d42bd70)
build user: portage@binhost
build date: 20180717-22:35:36
go version: go1.10.3
Will debug later when I get a chance.
The most minimal example which still crashes for me is running with only the wifi collector enabled.
None of the node_exporters with the wifi collector disabled have crashed for me so far (but it's hard to tell if it just takes longer or eliminates the problem entirely). Do we have any reports of crashes with the wifi collector disabled? I only see suspicions of the wifi being possibly the culprit: https://github.com/prometheus/node_exporter/issues/870#issuecomment-397692373 https://github.com/prometheus/node_exporter/issues/870#issuecomment-397692373
Also note: Sometimes it crashes within seconds, sometimes it takes >90min. Linux localhost 4.17.5-1-ARCH #1 SMP PREEMPT Sun Jul 8 17:27:31 UTC 2018 x86_64 GNU/Linux
, go version go1.10.3 linux/amd64
, node_exporter master @ 80a5712b97cd649b6bf2f9f60cf5c497043ef187. Also, the crash messages have been totally different (e.g. systemstack called from unexpected goroutine
, fatal error: malloc deadlock
). All contained the lock count
error somewhere. Sometimes, crashes manifest in a few lines, sometimes in thousands, sometimes with "clean" stack traces, sometimes with stack traces mixed with other error messages.
Will try further tracking it down by trying different go versions and/or bisecting the wifi collector code, as time permits.
$ ./node_exporter $(for coll in arp bcache bonding conntrack cpu diskstats edac entropy filefd filesystem hwmon infiniband ipvs loadavg mdadm meminfo netdev netstat nfs nfsd sockstat stat textfile time timex uname vmstat netclass xfs zfs; do echo --no-collector.$coll; done) --web.listen-address=:9101
INFO[0000] Starting node_exporter (version=0.16.0, branch=fix-staticcheck, revision=80a5712b97cd649b6bf2f9f60cf5c497043ef187) source="node_exporter.go:82"
INFO[0000] Build context (go=go1.10.3, user=christian@wuechoo, date=20180729-18:47:59) source="node_exporter.go:83"
INFO[0000] Enabled collectors: source="node_exporter.go:90"
INFO[0000] - wifi source="node_exporter.go:97"
INFO[0000] Listening on :9101 source="node_exporter.go:111"
[...]
fatal error: runtime·unlock: lock count
fatal: morestack on g0
@hoffie Interesting, with the wifi collector on, do you get goroutine leaks like was reported in https://github.com/prometheus/node_exporter/issues/1008?
Wifi collector is the only one using netlink, which does use syscall
.
/cc @mdlayher I don't think this is your fault, but thought you might want to be aware.
@SuperQ: No goroutine leaks that I would have spotted (no increased memory usage or anything). go_goroutines
metric seems to be in line with all my other node_exporter instances with other enabled collectors (8-14 routines) while being hammered with vegeta with 50qps.
Please also note that my results / suggested conclusion could still be bad luck -- i.e. it could still be that "node_exporter without any collectors" or "node_exporter with at least one collector" would crash as well. For me, they haven't. The only certain information is that "node_exporter with wifi collector crashes". I should also say that my machine does not have wifi devices at all, so node_exporter's output is limited to (as expected):
node_scrape_collector_duration_seconds{collector="wifi"} 0.000785804
node_scrape_collector_success{collector="wifi"} 1
Edit: Just noticed that the netlink package also contains ASM code for setsockopt
. I have no knowledge about ASM, but it seems to be possible to implement this using the more-standard maybe less-unsafe Syscall6 call. There also seem to be wrappers for that. From a quick glance it doesn't seem like this code path is hit at all though.
I have a couple of ideas I can try to investigate tomorrow, but in the mean time, disabling the WiFi collector seems reasonable.
At its core, the WiFi collector is more or less doing something like this on every scrape cycle:
package main
import (
"fmt"
"runtime"
"sync"
"github.com/mdlayher/genetlink"
)
func main() {
workers := runtime.NumCPU()
var wg sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++ {
go func() {
defer wg.Done()
for {
work()
}
}()
}
wg.Wait()
}
func work() {
c, err := genetlink.Dial(nil)
if err != nil {
panicf("failed to dial: %v", err)
}
defer c.Close()
if _, err := c.GetFamily("nl80211"); err != nil {
panicf("failed to dial: %v", err)
}
}
func panicf(format string, a ...interface{}) {
panic(fmt.Sprintf(format, a...))
}
So I'm running this program to see if I can trigger a crash on my machine:
matt@nerr-2:~$ uname -a
Linux nerr-2 4.15.0-29-generic #31-Ubuntu SMP Tue Jul 17 15:39:52 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
matt@nerr-2:~$ go version
go version go1.10.3 linux/amd64
No luck so far though.
The only thing that comes to mind with that "lock count" message would be a behavior I am using in Go 1.10+: https://golang.org/doc/go1.10#runtime.
As far as I know, this is being done correctly: https://github.com/mdlayher/netlink/blob/master/conn_linux.go#L316-L328.
Hence, if there's an issue, I would think that my minimal program could reproduce it without the need to pull in node_exporter.
I've run my stress test on a beefy VM today and didn't run into any problems. At first I wondered if it was because I was on Go 1.10.3, but it appears above that 0.16.0 is also based on Go 1.10.3.
I'll try with node_exporter itself tomorrow.
@mdlayher: Your example code runs stable for me, too. I also tried extracting more of node_exporter into a standalone binary and testing it in a similar way -- no crash so far.
I built a node_exporter with LockOSThread
usage disabled (see diff below). It has been running stable for 150 minutes+.
In my tests, crashes happened faster when I set GOMAXPROCS=2
. This may be just luck, though.
Overall, these results make me believe that it may be some runtime.LockOSThread
interaction with go scheduling (go routines, channels).
Will try building Go with the relevant 1.10 change reverted and will test with unchanged node_exporter source code/deps again.
During today's crashes I've seen some more variations in error messages/stack traces. Sometimes the lock count
one is not contained at all.
fatal error: runtime·unlock: lock count
fatal: morestack on g0
INFO[0000] Listening on :9101 source="node_exporter.go:111"
fatal error: schedule: holding locks
runtime stack:
runtime.throw(0xa0c651, 0x17)
/usr/lib/go/src/runtime/panic.go:616 +0x81
runtime.schedule()
/usr/lib/go/src/runtime/proc.go:2489 +0x351
runtime.goschedImpl(0xc420162300)
/usr/lib/go/src/runtime/proc.go:2619 +0xf7
runtime.gosched_m(0xc420162300)
/usr/lib/go/src/runtime/proc.go:2627 +0x36
runtime.mcall(0x1007f66f4ff8a4e)
/usr/lib/go/src/runtime/asm_amd64.s:351 +0x5b
goroutine 1 [IO wait, 19 minutes]:
internal/poll.runtime_pollWait(0x7f66fd609f00, 0x72, 0x0)
/usr/lib/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc4202e8518, 0x72, 0xc42019e000, 0x0, 0x0)
/usr/lib/go/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc4202e8518, 0xffffffffffffff00, 0x0, 0x0)
/usr/lib/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Accept(0xc4202e8500, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/usr/lib/go/src/internal/poll/fd_unix.go:372 +0x1a8
net.(*netFD).accept(0xc4202e8500, 0xc420282300, 0xc4205a1be0, 0x402ee8)
/usr/lib/go/src/net/fd_unix.go:238 +0x42
net.(*TCPListener).accept(0xc420192370, 0xc4205a1c10, 0x401e47, 0xc420282300)
/usr/lib/go/src/net/tcpsock_posix.go:136 +0x2e
net.(*TCPListener).AcceptTCP(0xc420192370, 0xc4205a1c58, 0xc4205a1c60, 0x18)
/usr/lib/go/src/net/tcpsock.go:246 +0x49
net/http.tcpKeepAliveListener.Accept(0xc420192370, 0xa2cd30, 0xc420282280, 0xa78400, 0xc42030e990)
/usr/lib/go/src/net/http/server.go:3216 +0x2f
net/http.(*Server).Serve(0xc420314270, 0xa780c0, 0xc420192370, 0x0, 0x0)
/usr/lib/go/src/net/http/server.go:2770 +0x1a5
net/http.(*Server).ListenAndServe(0xc420314270, 0xc420314270, 0x2)
/usr/lib/go/src/net/http/server.go:2711 +0xa9
net/http.ListenAndServe(0x7fffbd1df4a5, 0x5, 0x0, 0x0, 0x1, 0xc42030a7e0)
/usr/lib/go/src/net/http/server.go:2969 +0x7a
main.main()
/home/christian/go/src/github.com/prometheus/node_exporter/node_exporter.go:112 +0x9cf
goroutine 38 [IO wait, 12 minutes]:
internal/poll.runtime_pollWait(0x7f66fd609e30, 0x72, 0xc4205a3868)
/usr/lib/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc4202e8598, 0x72, 0xffffffffffffff00, 0xa73f40, 0xf2c9a8)
/usr/lib/go/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc4202e8598, 0xc420320000, 0x1000, 0x1000)
/usr/lib/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc4202e8580, 0xc420320000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/lib/go/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc4202e8580, 0xc420320000, 0x1000, 0x1000, 0xc4205a3978, 0x66a47a, 0xc42030ea88)
/usr/lib/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc420192380, 0xc420320000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/lib/go/src/net/net.go:176 +0x6a
net/http.(*connReader).Read(0xc42030ea80, 0xc420320000, 0x1000, 0x1000, 0x2, 0x0, 0x0)
/usr/lib/go/src/net/http/server.go:764 +0xf8
bufio.(*Reader).fill(0xc4202286c0)
/usr/lib/go/src/bufio/bufio.go:100 +0x11e
bufio.(*Reader).ReadSlice(0xc4202286c0, 0xc42017640a, 0xc4205a3ad8, 0x411d66, 0x7f66fd610458, 0x3, 0xc4205a3ac0)
/usr/lib/go/src/bufio/bufio.go:341 +0x2c
bufio.(*Reader).ReadLine(0xc4202286c0, 0xc4202c2300, 0x100, 0xf8, 0x9eb2c0, 0xa2cb01, 0x100c4205a3b58)
/usr/lib/go/src/bufio/bufio.go:370 +0x34
net/textproto.(*Reader).readLineSlice(0xc42036e240, 0xc4205b4c00, 0xc4205a3ba8, 0x412a48, 0x100, 0x9eb2c0)
/usr/lib/go/src/net/textproto/reader.go:55 +0x70
net/textproto.(*Reader).ReadLine(0xc42036e240, 0xc4202c2300, 0x0, 0x0, 0x72)
/usr/lib/go/src/net/textproto/reader.go:36 +0x2b
net/http.readRequest(0xc4202286c0, 0x0, 0xc4202c2300, 0x0, 0x0)
/usr/lib/go/src/net/http/request.go:929 +0x8c
net/http.(*conn).readRequest(0xc4201abf40, 0xa78340, 0xc42019eb00, 0x0, 0x0, 0x0)
/usr/lib/go/src/net/http/server.go:944 +0x16f
net/http.(*conn).serve(0xc4201abf40, 0xa78340, 0xc42019eb00)
/usr/lib/go/src/net/http/server.go:1768 +0x4dc
created by net/http.(*Server).Serve
/usr/lib/go/src/net/http/server.go:2795 +0x27b
goroutine 50907 [IO wait]:
internal/poll.runtime_pollWait(0x7f66fd609d60, 0x72, 0xc420390868)
/usr/lib/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc420262118, 0x72, 0xffffffffffffff00, 0xa73f40, 0xf2c9a8)
/usr/lib/go/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc420262118, 0xc4201ef000, 0x1000, 0x1000)
/usr/lib/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc420262100, 0xc4201ef000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/lib/go/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc420262100, 0xc4201ef000, 0x1000, 0x1000, 0xc420390978, 0x66a47a, 0xc420266a88)
/usr/lib/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc4201925c8, 0xc4201ef000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/lib/go/src/net/net.go:176 +0x6a
net/http.(*connReader).Read(0xc420266a80, 0xc4201ef000, 0x1000, 0x1000, 0x1, 0x17f66fd6736c8, 0x3)
/usr/lib/go/src/net/http/server.go:764 +0xf8
bufio.(*Reader).fill(0xc4204694a0)
/usr/lib/go/src/bufio/bufio.go:100 +0x11e
bufio.(*Reader).ReadSlice(0xc4204694a0, 0xc420390a0a, 0xc420390ad8, 0x411d66, 0x7f66fd60e908, 0x3, 0x7f66fd6736c8)
/usr/lib/go/src/bufio/bufio.go:341 +0x2c
bufio.(*Reader).ReadLine(0xc4204694a0, 0xc420256300, 0x100, 0xf8, 0x9eb2c0, 0x1, 0x100c42012b000)
/usr/lib/go/src/bufio/bufio.go:370 +0x34
net/textproto.(*Reader).readLineSlice(0xc4201be030, 0xc420341501, 0xc420390ba8, 0x412a48, 0x100, 0x9eb2c0)
/usr/lib/go/src/net/textproto/reader.go:55 +0x70
net/textproto.(*Reader).ReadLine(0xc4201be030, 0xc420256300, 0x0, 0x0, 0x72)
/usr/lib/go/src/net/textproto/reader.go:36 +0x2b
net/http.readRequest(0xc4204694a0, 0x0, 0xc420256300, 0x0, 0x0)
/usr/lib/go/src/net/http/request.go:929 +0x8c
net/http.(*conn).readRequest(0xc420282280, 0xa78340, 0xc4205de2c0, 0x0, 0x0, 0x0)
/usr/lib/go/src/net/http/server.go:944 +0x16f
net/http.(*conn).serve(0xc420282280, 0xa78340, 0xc4205de2c0)
/usr/lib/go/src/net/http/server.go:1768 +0x4dc
created by net/http.(*Server).Serve
/usr/lib/go/src/net/http/server.go:2795 +0x27b
Disabling LockOSThread for debugging (don't think it's a valid fix as it seems to break correctness guarantees in netlink API usage):
diff --git a/vendor/github.com/mdlayher/wifi/client_linux.go b/vendor/github.com/mdlayher/wifi/client_linux.go
index 4edb9bc..e9c93c6 100644
--- a/vendor/github.com/mdlayher/wifi/client_linux.go
+++ b/vendor/github.com/mdlayher/wifi/client_linux.go
@@ -36,7 +36,10 @@ type client struct {
// newClient dials a generic netlink connection and verifies that nl80211
// is available for use by this package.
func newClient() (*client, error) {
- c, err := genetlink.Dial(nil)
+ cfg := &netlink.Config{
+ NoLockThread: true,
+ }
+ c, err := genetlink.Dial(cfg)
if err != nil {
return nil, err
}
Thanks for all the debugging work you've done so far. I was curious about NoLockThread as well but didn't give it a try yet. It does seem like it's possible that we're causing some unexpected state in the runtime which could cause a crash.
Just found a very similar looking Go issue (which is about Plan 9): https://github.com/golang/go/issues/22227
I have run the following tests:
(1) is supposed to confirm the problem (again) (2) could be an attempt regarding a workaround from the node_exporter (or rather mdlayher/netlink) perspective (3) should help in tracking down the relevant go change.
@mdlayher: Can you provide some comments around (2)? As I understand your netlink code, you have to guarentee that the netlink calls from one (node_exporter) request always originate from the OS thread, right? It would be safe to return this thread into the normal Go thread pool after netlink usage, right? Is the position I suggest in the diff the proper place to do that?
Next steps:
diff --git a/vendor/github.com/mdlayher/netlink/conn_linux.go b/vendor/github.com/mdlayher/netlink/conn_linux.go
index 2c9e38a..39363ca 100644
--- a/vendor/github.com/mdlayher/netlink/conn_linux.go
+++ b/vendor/github.com/mdlayher/netlink/conn_linux.go
@@ -251,6 +251,7 @@ func newSysSocket(lockThread bool) *sysSocket {
// the goroutine exits starting in Go 1.10:
// https://go-review.googlesource.com/c/go/+/46038.
runtime.LockOSThread()
+ defer runtime.UnlockOSThread()
}
defer wg.Done()
That patch seems reasonable to me because I'm not mucking with network namespaces or anything at the moment. If it provides stability, that's excellent, but I think there may still be something wrong on the runtime end if the documented Go 1.10+ behavior causes a fatal runtime error after some time.
Alright. Yes, I also think this is a Go bug, but it might make sense to provide some intermediate workaround until the bug is properly solved.
Bisecting Go lead to https://github.com/golang/go/commit/4f34a5291317fe959074fb40b85d9be242b8a2fc as the problematic change. Will see where this leads us.
A note from our conversation on that thread:
Keep in mind that just because bisect points at a commit does not mean the commit is bad. It could be a perfectly correct commit that is just exposing an issue that already existed. I could certainly see how thrashing OS threads more often by forcing termination when a locked goroutine exits could expose other issues (e.g. memory corruption) elsewhere.
I am curious if we have a Cgo bug somewhere in one of the collectors and the interaction with the WiFi collector is causing it all to explode.
The fact that it crashes with just the wifi collector suggests this isn't an issue with another collector. Does the node_exporter core rely on Cgo for anything?
Kind of an aside here, but given the node exporter seems to primarly target backend systems, should we consider disabling the WiFi collector by default, at least until it the issue is resolved?
While I would like to turn it off, I think it would make too many eyes stop looking into it and it would never get fixed.
I'll consider it.
Reading up on golang/go#25128 it seems @mdlayher has applied a fix to to netlink in the form of https://github.com/mdlayher/netlink/commit/3d8cc9af89458b8b6a9ce675d4dfa50a724e9025 that causes the error to disappear.
Even though the WiFi collector is now disabled by default it seems worthwhile to upgrade to at least that commit so that those that do need the WiFi collector and enable it can have a more stable experience.
@daenney Now that we isolated this issue a bit further, maybe it makes sense to update the title and description of this issue since it seems like it might take a while to get this fixed upstream and released.
@discordianfish Mmm, indeed. I updated it to "when WiFi collector is enabled" only to realise that this is no longer the case as with the changes @mdlayher has done the issue appears resolved.
Since there is an upstream issue about the particulars of this bug, should we consider closing this? Is there anything left to do from a Prometheus/node_exporter side of things?
@daenney Agree, probably is fine to close this.
Host operating system: output of
uname -a
node_exporter version: output of
node_exporter --version
Used the release artifact at: https://github.com/prometheus/node_exporter/releases/download/v0.16.0-rc.0/node_exporter-0.16.0-rc.0.linux-amd64.tar.gz
node_exporter command line flags
None, the defaults for 0.16 match my needs
Are you running node_exporter in Docker?
No
What did you do that produced an error?
Just ran it for a couple of days
What did you expect to see?
It not to crash
What did you see instead?
That
fatal error
line got spewed about a 1000 times, all logged at19:47:54
according to systemd.