fatal error: runtime.unlock: lock count with Go >= 1.10

daenney commented 6 years ago

Host operating system: output of `uname -a`

Linux endor 4.13.0-37-generic #42~16.04.1-Ubuntu SMP Wed Mar 7 16:03:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of `node_exporter --version`

node_exporter, version 0.16.0-rc.0 (branch: HEAD, revision: 002c1ca02917406cbecc457162e2bdb1f29c2f49)
  build user:       root@5ff5455ac873
  build date:       20180309-15:09:26
  go version:       go1.10

Used the release artifact at: https://github.com/prometheus/node_exporter/releases/download/v0.16.0-rc.0/node_exporter-0.16.0-rc.0.linux-amd64.tar.gz

node_exporter command line flags

None, the defaults for 0.16 match my needs

Are you running node_exporter in Docker?

No

What did you do that produced an error?

Just ran it for a couple of days

What did you expect to see?

It not to crash

What did you see instead?

Mar 28 19:47:54 endor node_exporter[18076]: fatal error: runtime·unlock: lock count
Mar 28 19:48:03 endor systemd[1]: node_exporter.service: Main process exited, code=killed, status=11/SEGV

That fatal error line got spewed about a 1000 times, all logged at 19:47:54 according to systemd.

austin987 commented 6 years ago

@daenney ah, so it is; I didn't read closely enough. Should I file a new issue for that?

daenney commented 6 years ago

@austin987 I'm not sure. Might be good for visibility though, so we don't forget about tracking that one down too, we risk it getting burried in the rest of this issue. They feel related to me.

mschroeder223 commented 6 years ago

I posted this previously on April 2nd, I had disabled infiniband, zfs, wifi, nfs and nfsd and I've had rc.1 running on multiple servers with no issues since then. Which would lead me to believe the issue lies in one of those collectors.

daenney commented 6 years ago

Those collectors are active on my system but aren't actually doing anything since I don't have any infiniband, ZFS, nfs(d) active. It bails out in the first check in Update() of each collector which at least in my case would make them unlikely to be involved in that crash.

I do have WiFi since there's a built-in WiFi thing in this box. The only metric collected is node_wifi_interface_frequency_hertz, it's not actually active or scanning or connected to any network. Looking at that code I'm having a hard time seeing how it itself would cause such an error. The one stack trace we do seem to have from @MorphBonehunter doesn't mention anything related to the wifi collector in the crash which I would expect to see if it was somehow involved.

kielfriedt commented 6 years ago

any progress on this? Any temporary fix?

SuperQ commented 6 years ago

The temporary fix is to build with Go 1.9.x.

daenney commented 6 years ago

I'm honestly a bit stumped though. I'm not sure where to go from here and golang/go#25128 kinda feels dead in the water (though I suppose that have to do with things ramping up to Go 1.11).

If anyone has suggestion on what to try, I'm all ears. I can build/run whatever and try and crash it, but it would help if we had osme kind of strategy to narrow it down.

I guess we could run a bisect between Go 1.9.x and 1.10 and see when the crash starts to appear?

discordianfish commented 6 years ago

@daenney Yes, bisect between 1.9.x and 1.10 would be what I would suggest to. Would be awesome if you could look into this!

daenney commented 6 years ago

Alright, looks like something started moving in golang/go#25128 and with golang/go@4991bc6257a9e9d922f7b6e29e393d764c4e4295 we might be able to get a proper error trace out of it. I'll see about giving that a try first before I dive into bisecting Go 1.9 to 1.10, as that'll probably take a long time to narrow down given that it takes some time to trigger the error.

egonelbre commented 6 years ago

I stumbled into this issue by accident and took a quick look around in the code, so ignore when I say something completely stupid, since I have no in-depth understanding.

When VM and/or heisenbug with varying timing, I usually suspect time drift problems. There are ways to force time problems e.g. ntpdsim, but I didn't get to that. Anyways, this suspicion led me into https://github.com/prometheus/node_exporter/blob/b62c7bc0ad37373b911fa45937348f3c594ce19d/collector/ntp.go#L148... but, shouldn't leapMidnight be protected somehow? Because, as far as I can understand Update may be called concurrently.

SuperQ commented 6 years ago

@egonelbre Good catch, yes, that does appear to need a mutex on it. Probably unrelated to this specific crash, as we're pretty sure it's in the CGO code, and that collector is pure go. I'll file a separate issue for that.

daenney commented 6 years ago

Has anyone tried building this with Go 1.11? I'm short on time right now to try and do a full bisect between Go 1.9 and 1.10, but if it happens to be gone with 1.11 I'd be good with that :smile:.

chrj commented 6 years ago

I will try running under under the Go 1.11 beta and put our node_exporters up for the test. We have had 1-2 crashes per week on ~70 instances. I'll report back in a week.

marcan commented 6 years ago

Oh hey, you might remember me from that other time node_exporter found a Golang bug. I just hit this one too:

fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: runtime·unlock: lock count
fatal error: fatal error: runtime·unlock: lock countruntime·lock: lock count

panic during panic
stack trace unavailable

runtime stack:

This is Gentoo again, versions:

node_exporter, version 0.16.0 (branch: non-git, revision: d42bd70)
  build user:       portage@binhost
  build date:       20180619-21:57:10
  go version:       go1.10.1

Unfortunately this just happened on a VM with many siblings running on the same hardware, after days of uptime, and I will only have access to this hardware for a couple more days, so it's not looking good for reproducibility. However, I'll be running these VM images at home and see if I can get it to repro on other hardware.

SuperQ commented 6 years ago

@marcan Welcome back! :joy:

It's reasonably easy to reproduce. Slamming the node_exporter with vegeta with like 3QPS will crash it in ~30min.

But nobody's been able to bisect enough things or get enough trace info to figure out so far. :(

marcan commented 6 years ago

Just in case hardware is related to reproducibility, I'll dump my stack info here:

(32-thread host)

Linux vmhost04 4.17.2-gentoo #1 SMP Mon Jul 23 16:11:24 CEST 2018 x86_64 Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz GenuineIntel GNU/Linux
QEMU emulator version 2.12.0

(16-CPU guest)

Linux csgo06 4.17.2-gentoo #4 SMP Thu Jun 28 19:00:28 CEST 2018 x86_64 Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz GenuineIntel GNU/Linux
sys-libs/glibc-2.26-r7

I'll be testing the same exact guest VM image again later, but I won't be able to get ahold of identical server hardware (but I can try on a bunch of different hosts I do have access to). Hopefully I can spend some time tracking it down. Unfortunately it's less likely I'll be able to find time to do it on the original hardware where I saw the crash, before I lose access.

SuperQ commented 6 years ago

So far, it doesn't seem hardware related, only CGO issue with 1.10

chrj commented 6 years ago

Unfortunately, I'm also seeing the issue with Go 1.11beta2.

marcan commented 6 years ago

Well, this just happened on another server of mine, so reproducibility won't be an issue. This time it spat out >95000 lines (!) of "fatal error: runtime·unlock: lock count" into the log.

node_exporter, version 0.16.0 (branch: non-git, revision: d42bd70)
  build user:       portage@binhost
  build date:       20180717-22:35:36
  go version:       go1.10.3

Will debug later when I get a chance.

hoffie commented 6 years ago

The most minimal example which still crashes for me is running with only the wifi collector enabled.

None of the node_exporters with the wifi collector disabled have crashed for me so far (but it's hard to tell if it just takes longer or eliminates the problem entirely). Do we have any reports of crashes with the wifi collector disabled? I only see suspicions of the wifi being possibly the culprit: https://github.com/prometheus/node_exporter/issues/870#issuecomment-397692373 https://github.com/prometheus/node_exporter/issues/870#issuecomment-397692373

Also note: Sometimes it crashes within seconds, sometimes it takes >90min. Linux localhost 4.17.5-1-ARCH #1 SMP PREEMPT Sun Jul 8 17:27:31 UTC 2018 x86_64 GNU/Linux, go version go1.10.3 linux/amd64, node_exporter master @ 80a5712b97cd649b6bf2f9f60cf5c497043ef187. Also, the crash messages have been totally different (e.g. systemstack called from unexpected goroutine, fatal error: malloc deadlock). All contained the lock count error somewhere. Sometimes, crashes manifest in a few lines, sometimes in thousands, sometimes with "clean" stack traces, sometimes with stack traces mixed with other error messages.

Will try further tracking it down by trying different go versions and/or bisecting the wifi collector code, as time permits.

$ ./node_exporter $(for coll in arp bcache bonding conntrack cpu diskstats edac entropy filefd filesystem hwmon infiniband ipvs loadavg mdadm meminfo netdev netstat nfs nfsd sockstat stat textfile time timex uname vmstat netclass xfs zfs; do echo --no-collector.$coll; done) --web.listen-address=:9101
INFO[0000] Starting node_exporter (version=0.16.0, branch=fix-staticcheck, revision=80a5712b97cd649b6bf2f9f60cf5c497043ef187)  source="node_exporter.go:82"
INFO[0000] Build context (go=go1.10.3, user=christian@wuechoo, date=20180729-18:47:59)  source="node_exporter.go:83"
INFO[0000] Enabled collectors:                           source="node_exporter.go:90"
INFO[0000]  - wifi                                       source="node_exporter.go:97"
INFO[0000] Listening on :9101                            source="node_exporter.go:111"
[...]
fatal error: runtime·unlock: lock count
fatal: morestack on g0

SuperQ commented 6 years ago

@hoffie Interesting, with the wifi collector on, do you get goroutine leaks like was reported in https://github.com/prometheus/node_exporter/issues/1008?

Wifi collector is the only one using netlink, which does use syscall.

/cc @mdlayher I don't think this is your fault, but thought you might want to be aware.

hoffie commented 6 years ago

@SuperQ: No goroutine leaks that I would have spotted (no increased memory usage or anything). go_goroutines metric seems to be in line with all my other node_exporter instances with other enabled collectors (8-14 routines) while being hammered with vegeta with 50qps.

Please also note that my results / suggested conclusion could still be bad luck -- i.e. it could still be that "node_exporter without any collectors" or "node_exporter with at least one collector" would crash as well. For me, they haven't. The only certain information is that "node_exporter with wifi collector crashes". I should also say that my machine does not have wifi devices at all, so node_exporter's output is limited to (as expected):

node_scrape_collector_duration_seconds{collector="wifi"} 0.000785804
node_scrape_collector_success{collector="wifi"} 1

Edit: Just noticed that the netlink package also contains ASM code for setsockopt. I have no knowledge about ASM, but it seems to be possible to implement this using the more-standard maybe less-unsafe Syscall6 call. There also seem to be wrappers for that. From a quick glance it doesn't seem like this code path is hit at all though.

mdlayher commented 6 years ago

I have a couple of ideas I can try to investigate tomorrow, but in the mean time, disabling the WiFi collector seems reasonable.

mdlayher commented 6 years ago

At its core, the WiFi collector is more or less doing something like this on every scrape cycle:

package main

import (
    "fmt"
    "runtime"
    "sync"

    "github.com/mdlayher/genetlink"
)

func main() {
    workers := runtime.NumCPU()

    var wg sync.WaitGroup
    wg.Add(workers)

    for i := 0; i < workers; i++ {
        go func() {
            defer wg.Done()

            for {
                work()
            }
        }()
    }

    wg.Wait()
}

func work() {
    c, err := genetlink.Dial(nil)
    if err != nil {
        panicf("failed to dial: %v", err)
    }
    defer c.Close()

    if _, err := c.GetFamily("nl80211"); err != nil {
        panicf("failed to dial: %v", err)
    }
}

func panicf(format string, a ...interface{}) {
    panic(fmt.Sprintf(format, a...))
}

So I'm running this program to see if I can trigger a crash on my machine:

matt@nerr-2:~$ uname -a
Linux nerr-2 4.15.0-29-generic #31-Ubuntu SMP Tue Jul 17 15:39:52 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
matt@nerr-2:~$ go version
go version go1.10.3 linux/amd64

No luck so far though.

The only thing that comes to mind with that "lock count" message would be a behavior I am using in Go 1.10+: https://golang.org/doc/go1.10#runtime.

As far as I know, this is being done correctly: https://github.com/mdlayher/netlink/blob/master/conn_linux.go#L316-L328.

Hence, if there's an issue, I would think that my minimal program could reproduce it without the need to pull in node_exporter.

mdlayher commented 6 years ago

I've run my stress test on a beefy VM today and didn't run into any problems. At first I wondered if it was because I was on Go 1.10.3, but it appears above that 0.16.0 is also based on Go 1.10.3.

I'll try with node_exporter itself tomorrow.

hoffie commented 6 years ago

@mdlayher: Your example code runs stable for me, too. I also tried extracting more of node_exporter into a standalone binary and testing it in a similar way -- no crash so far.

I built a node_exporter with LockOSThread usage disabled (see diff below). It has been running stable for 150 minutes+.

In my tests, crashes happened faster when I set GOMAXPROCS=2. This may be just luck, though.

Overall, these results make me believe that it may be some runtime.LockOSThread interaction with go scheduling (go routines, channels).

Will try building Go with the relevant 1.10 change reverted and will test with unchanged node_exporter source code/deps again.

During today's crashes I've seen some more variations in error messages/stack traces. Sometimes the lock count one is not contained at all.

fatal error: runtime·unlock: lock count
fatal: morestack on g0

INFO[0000] Listening on :9101                            source="node_exporter.go:111"
 fatal error: schedule: holding locks

runtime stack:
runtime.throw(0xa0c651, 0x17)
    /usr/lib/go/src/runtime/panic.go:616 +0x81
runtime.schedule()
    /usr/lib/go/src/runtime/proc.go:2489 +0x351
runtime.goschedImpl(0xc420162300)
    /usr/lib/go/src/runtime/proc.go:2619 +0xf7
runtime.gosched_m(0xc420162300)
    /usr/lib/go/src/runtime/proc.go:2627 +0x36
runtime.mcall(0x1007f66f4ff8a4e)
    /usr/lib/go/src/runtime/asm_amd64.s:351 +0x5b

goroutine 1 [IO wait, 19 minutes]:
internal/poll.runtime_pollWait(0x7f66fd609f00, 0x72, 0x0)
    /usr/lib/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc4202e8518, 0x72, 0xc42019e000, 0x0, 0x0)
    /usr/lib/go/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc4202e8518, 0xffffffffffffff00, 0x0, 0x0)
    /usr/lib/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Accept(0xc4202e8500, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /usr/lib/go/src/internal/poll/fd_unix.go:372 +0x1a8
net.(*netFD).accept(0xc4202e8500, 0xc420282300, 0xc4205a1be0, 0x402ee8)
    /usr/lib/go/src/net/fd_unix.go:238 +0x42
net.(*TCPListener).accept(0xc420192370, 0xc4205a1c10, 0x401e47, 0xc420282300)
    /usr/lib/go/src/net/tcpsock_posix.go:136 +0x2e
net.(*TCPListener).AcceptTCP(0xc420192370, 0xc4205a1c58, 0xc4205a1c60, 0x18)
    /usr/lib/go/src/net/tcpsock.go:246 +0x49
net/http.tcpKeepAliveListener.Accept(0xc420192370, 0xa2cd30, 0xc420282280, 0xa78400, 0xc42030e990)
    /usr/lib/go/src/net/http/server.go:3216 +0x2f
net/http.(*Server).Serve(0xc420314270, 0xa780c0, 0xc420192370, 0x0, 0x0)
    /usr/lib/go/src/net/http/server.go:2770 +0x1a5
net/http.(*Server).ListenAndServe(0xc420314270, 0xc420314270, 0x2)
    /usr/lib/go/src/net/http/server.go:2711 +0xa9
net/http.ListenAndServe(0x7fffbd1df4a5, 0x5, 0x0, 0x0, 0x1, 0xc42030a7e0)
    /usr/lib/go/src/net/http/server.go:2969 +0x7a
main.main()
    /home/christian/go/src/github.com/prometheus/node_exporter/node_exporter.go:112 +0x9cf

goroutine 38 [IO wait, 12 minutes]:
internal/poll.runtime_pollWait(0x7f66fd609e30, 0x72, 0xc4205a3868)
    /usr/lib/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc4202e8598, 0x72, 0xffffffffffffff00, 0xa73f40, 0xf2c9a8)
    /usr/lib/go/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc4202e8598, 0xc420320000, 0x1000, 0x1000)
    /usr/lib/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc4202e8580, 0xc420320000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/lib/go/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc4202e8580, 0xc420320000, 0x1000, 0x1000, 0xc4205a3978, 0x66a47a, 0xc42030ea88)
    /usr/lib/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc420192380, 0xc420320000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/lib/go/src/net/net.go:176 +0x6a
net/http.(*connReader).Read(0xc42030ea80, 0xc420320000, 0x1000, 0x1000, 0x2, 0x0, 0x0)
    /usr/lib/go/src/net/http/server.go:764 +0xf8
bufio.(*Reader).fill(0xc4202286c0)
    /usr/lib/go/src/bufio/bufio.go:100 +0x11e
bufio.(*Reader).ReadSlice(0xc4202286c0, 0xc42017640a, 0xc4205a3ad8, 0x411d66, 0x7f66fd610458, 0x3, 0xc4205a3ac0)
    /usr/lib/go/src/bufio/bufio.go:341 +0x2c
bufio.(*Reader).ReadLine(0xc4202286c0, 0xc4202c2300, 0x100, 0xf8, 0x9eb2c0, 0xa2cb01, 0x100c4205a3b58)
    /usr/lib/go/src/bufio/bufio.go:370 +0x34
net/textproto.(*Reader).readLineSlice(0xc42036e240, 0xc4205b4c00, 0xc4205a3ba8, 0x412a48, 0x100, 0x9eb2c0)
    /usr/lib/go/src/net/textproto/reader.go:55 +0x70
net/textproto.(*Reader).ReadLine(0xc42036e240, 0xc4202c2300, 0x0, 0x0, 0x72)
    /usr/lib/go/src/net/textproto/reader.go:36 +0x2b
net/http.readRequest(0xc4202286c0, 0x0, 0xc4202c2300, 0x0, 0x0)
    /usr/lib/go/src/net/http/request.go:929 +0x8c
net/http.(*conn).readRequest(0xc4201abf40, 0xa78340, 0xc42019eb00, 0x0, 0x0, 0x0)
    /usr/lib/go/src/net/http/server.go:944 +0x16f
net/http.(*conn).serve(0xc4201abf40, 0xa78340, 0xc42019eb00)
    /usr/lib/go/src/net/http/server.go:1768 +0x4dc
created by net/http.(*Server).Serve
    /usr/lib/go/src/net/http/server.go:2795 +0x27b

goroutine 50907 [IO wait]:
internal/poll.runtime_pollWait(0x7f66fd609d60, 0x72, 0xc420390868)
    /usr/lib/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc420262118, 0x72, 0xffffffffffffff00, 0xa73f40, 0xf2c9a8)
    /usr/lib/go/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc420262118, 0xc4201ef000, 0x1000, 0x1000)
    /usr/lib/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc420262100, 0xc4201ef000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/lib/go/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc420262100, 0xc4201ef000, 0x1000, 0x1000, 0xc420390978, 0x66a47a, 0xc420266a88)
    /usr/lib/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc4201925c8, 0xc4201ef000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/lib/go/src/net/net.go:176 +0x6a
net/http.(*connReader).Read(0xc420266a80, 0xc4201ef000, 0x1000, 0x1000, 0x1, 0x17f66fd6736c8, 0x3)
    /usr/lib/go/src/net/http/server.go:764 +0xf8
bufio.(*Reader).fill(0xc4204694a0)
    /usr/lib/go/src/bufio/bufio.go:100 +0x11e
bufio.(*Reader).ReadSlice(0xc4204694a0, 0xc420390a0a, 0xc420390ad8, 0x411d66, 0x7f66fd60e908, 0x3, 0x7f66fd6736c8)
    /usr/lib/go/src/bufio/bufio.go:341 +0x2c
bufio.(*Reader).ReadLine(0xc4204694a0, 0xc420256300, 0x100, 0xf8, 0x9eb2c0, 0x1, 0x100c42012b000)
    /usr/lib/go/src/bufio/bufio.go:370 +0x34
net/textproto.(*Reader).readLineSlice(0xc4201be030, 0xc420341501, 0xc420390ba8, 0x412a48, 0x100, 0x9eb2c0)
    /usr/lib/go/src/net/textproto/reader.go:55 +0x70
net/textproto.(*Reader).ReadLine(0xc4201be030, 0xc420256300, 0x0, 0x0, 0x72)
    /usr/lib/go/src/net/textproto/reader.go:36 +0x2b
net/http.readRequest(0xc4204694a0, 0x0, 0xc420256300, 0x0, 0x0)
    /usr/lib/go/src/net/http/request.go:929 +0x8c
net/http.(*conn).readRequest(0xc420282280, 0xa78340, 0xc4205de2c0, 0x0, 0x0, 0x0)
    /usr/lib/go/src/net/http/server.go:944 +0x16f
net/http.(*conn).serve(0xc420282280, 0xa78340, 0xc4205de2c0)
    /usr/lib/go/src/net/http/server.go:1768 +0x4dc
created by net/http.(*Server).Serve
    /usr/lib/go/src/net/http/server.go:2795 +0x27b

Disabling LockOSThread for debugging (don't think it's a valid fix as it seems to break correctness guarantees in netlink API usage):

diff --git a/vendor/github.com/mdlayher/wifi/client_linux.go b/vendor/github.com/mdlayher/wifi/client_linux.go
index 4edb9bc..e9c93c6 100644
--- a/vendor/github.com/mdlayher/wifi/client_linux.go
+++ b/vendor/github.com/mdlayher/wifi/client_linux.go
@@ -36,7 +36,10 @@ type client struct {
 // newClient dials a generic netlink connection and verifies that nl80211
 // is available for use by this package.
 func newClient() (*client, error) {
-       c, err := genetlink.Dial(nil)
+       cfg := &netlink.Config{
+               NoLockThread: true,
+       }
+       c, err := genetlink.Dial(cfg)
        if err != nil {
                return nil, err
        }

mdlayher commented 6 years ago

Thanks for all the debugging work you've done so far. I was curious about NoLockThread as well but didn't give it a try yet. It does seem like it's possible that we're causing some unexpected state in the runtime which could cause a crash.

hoffie commented 6 years ago

Just found a very similar looking Go issue (which is about Plan 9): https://github.com/golang/go/issues/22227

I have run the following tests:

node_exporter (vanilla) built with go1.10.3 (vanilla, built from source). First crash after 5 minutes. Second crash after 3 minutes. Even more afterwards.
node_exporter with the below diff built with go1.10.3 (vanilla, built from source). Ran stable for an hour.
node_exporter (vanilla) built with go at revision 555c16d8cbd8372c1f57beb059a23b56bf1e909f (last commit before the LockOSThread changes). Ran stable for an hour.

(1) is supposed to confirm the problem (again) (2) could be an attempt regarding a workaround from the node_exporter (or rather mdlayher/netlink) perspective (3) should help in tracking down the relevant go change.

@mdlayher: Can you provide some comments around (2)? As I understand your netlink code, you have to guarentee that the netlink calls from one (node_exporter) request always originate from the OS thread, right? It would be safe to return this thread into the normal Go thread pool after netlink usage, right? Is the position I suggest in the diff the proper place to do that?

Next steps:

Confirm with @mdlayher whether the workaround could/should be implemented. This should be given some more testing regarding stability and correctness, though.
Bisect the actual breaking go change (will do that in the coming days)
Provide an update in the upstream issue https://github.com/golang/go/issues/25128 (will do that in a moment)
Condense node_exporter + wifi collector to some very minimal reproducer which can be handed to the Go developers (the other way round -- extracting the netlink part -- did not work as mentioned above).


diff --git a/vendor/github.com/mdlayher/netlink/conn_linux.go b/vendor/github.com/mdlayher/netlink/conn_linux.go
index 2c9e38a..39363ca 100644
--- a/vendor/github.com/mdlayher/netlink/conn_linux.go
+++ b/vendor/github.com/mdlayher/netlink/conn_linux.go
@@ -251,6 +251,7 @@ func newSysSocket(lockThread bool) *sysSocket {
                        // the goroutine exits starting in Go 1.10:
                        // https://go-review.googlesource.com/c/go/+/46038.
                        runtime.LockOSThread()
+                       defer runtime.UnlockOSThread()
                }

                defer wg.Done()

mdlayher commented 6 years ago

That patch seems reasonable to me because I'm not mucking with network namespaces or anything at the moment. If it provides stability, that's excellent, but I think there may still be something wrong on the runtime end if the documented Go 1.10+ behavior causes a fatal runtime error after some time.

hoffie commented 6 years ago

Alright. Yes, I also think this is a Go bug, but it might make sense to provide some intermediate workaround until the bug is properly solved.

Bisecting Go lead to https://github.com/golang/go/commit/4f34a5291317fe959074fb40b85d9be242b8a2fc as the problematic change. Will see where this leads us.

mdlayher commented 6 years ago

A note from our conversation on that thread:

Keep in mind that just because bisect points at a commit does not mean the commit is bad. It could be a perfectly correct commit that is just exposing an issue that already existed. I could certainly see how thrashing OS threads more often by forcing termination when a locked goroutine exits could expose other issues (e.g. memory corruption) elsewhere.

I am curious if we have a Cgo bug somewhere in one of the collectors and the interaction with the WiFi collector is causing it all to explode.

marcan commented 6 years ago

The fact that it crashes with just the wifi collector suggests this isn't an issue with another collector. Does the node_exporter core rely on Cgo for anything?

daenney commented 6 years ago

Kind of an aside here, but given the node exporter seems to primarly target backend systems, should we consider disabling the WiFi collector by default, at least until it the issue is resolved?

SuperQ commented 6 years ago

While I would like to turn it off, I think it would make too many eyes stop looking into it and it would never get fixed.

I'll consider it.

daenney commented 6 years ago

Reading up on golang/go#25128 it seems @mdlayher has applied a fix to to netlink in the form of https://github.com/mdlayher/netlink/commit/3d8cc9af89458b8b6a9ce675d4dfa50a724e9025 that causes the error to disappear.

Even though the WiFi collector is now disabled by default it seems worthwhile to upgrade to at least that commit so that those that do need the WiFi collector and enable it can have a more stable experience.

discordianfish commented 6 years ago

@daenney Now that we isolated this issue a bit further, maybe it makes sense to update the title and description of this issue since it seems like it might take a while to get this fixed upstream and released.

daenney commented 6 years ago

@discordianfish Mmm, indeed. I updated it to "when WiFi collector is enabled" only to realise that this is no longer the case as with the changes @mdlayher has done the issue appears resolved.

Since there is an upstream issue about the particulars of this bug, should we consider closing this? Is there anything left to do from a Prometheus/node_exporter side of things?

discordianfish commented 5 years ago

@daenney Agree, probably is fine to close this.

prometheus / node_exporter