golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.95k stars 17.53k forks source link

runtime: bad p & random crashes since 1.14 on linux 3.18.11 #40722

Open gwillem opened 4 years ago

gwillem commented 4 years ago

Summary: a simple test program built with go1.14 & go1.15 crashes randomly (~92% cases) on a specific Linux server. No problem with 1.13.15.

What version of Go are you using (go version)?

$ go version
go version go1.14 linux/amd64

Does this issue reproduce with the latest release?

Working builds: 1.13.7 1.13.15 Crashing builds: 1.14 1.14.4 1.14.6 1.15

What operating system and processor architecture are you using (go env)?

Build environment:

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/willem/.cache/go-build"
GOENV="/home/willem/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/willem/code/golang/pkg/mod"
GONOPROXY="github.com/gwillem/ecomscan"
GONOSUMDB="github.com/gwillem/ecomscan"
GOOS="linux"
GOPATH="/home/willem/code/golang"
GOPRIVATE="github.com/gwillem/ecomscan"
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/opt/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/opt/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build810063723=/tmp/go-build -gno-record-gcc-switches"

Runtime environment:

Linux myserver 3.18.11-vs2.3.7.4-magestack-6 #11 SMP Mon Oct 9 17:49:51 BST 2017 x86_64 GNU/Linux
Debian GNU/Linux 8.9

$ free
             total       used       free     shared    buffers     cached
Mem:      16289352     409000   15880352     429172          0     267272
-/+ buffers/cache:     141728   16147624
Swap:      2932732          0    2932732

$ cat /proc/cpuinfo
[...]
processor   : 7
vendor_id   : GenuineIntel
cpu family  : 6
model       : 94
model name  : Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz
stepping    : 3
microcode   : 0xb2
cpu MHz     : 3501.000
cache size  : 8192 KB
physical id : 0
siblings    : 8
core id     : 3
cpu cores   : 4
apicid      : 7
initial apicid  : 7
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves
bugs        :
bogomips    : 7008.02
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

What did you do?

crashtest.go:

package main

import (
    "fmt"
    "io/ioutil"
    "net/http"
)

const url = "https://www.google.com/robots.txt"

func main() {
    resp, err := http.Get(url)
    if err != nil {
        panic(err)
    }
    defer resp.Body.Close()
    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        panic(err)
    }

    fmt.Println(string(body))
    fmt.Println("no error")
}

Test runner:

for x in `seq 1 10`; do echo "$(date) $(./crashtest |& head -1)"; done

What did you see instead?

$ for x in `seq 1 10`; do echo "$(date) $(./crashtest |& head -1)"; done
Wed Aug 12 12:30:27 BST 2020 User-agent: *
Wed Aug 12 12:30:27 BST 2020 fatal error: runtimer: bad p
Wed Aug 12 12:30:27 BST 2020 fatal error: missing stack in shrinkstack
Wed Aug 12 12:30:27 BST 2020 User-agent: *
Wed Aug 12 12:30:27 BST 2020 User-agent: *
Wed Aug 12 12:30:27 BST 2020 fatal error: runtimer: bad p
Wed Aug 12 12:30:27 BST 2020 fatal error: runtimer: bad p
Wed Aug 12 12:30:27 BST 2020 User-agent: *
Wed Aug 12 12:30:27 BST 2020 User-agent: *
Wed Aug 12 12:30:27 BST 2020 

When ran with a few hundred tests, it crashes in ~92% of cases. As you can see, the error message differs. For 200 tests, these are the counted first lines of the errors:

108 fatal error: unexpected signal during runtime execution
67 fatal error: runtimer: bad p
16 <no error>
5 fatal error: missing stack in shrinkstack
1 fatal error: dodeltimer0: wrong P
1 fatal: systemstack called from unexpected goroutineSIGTRAP: trace trap
1 panic: runtime error: invalid memory address or nil pointer dereference
1 runtime: nelems=21 nalloc=3 previous allocCount=2 nfreed=65535

Sample stacktrace:

$ ./crashtest 
fatal error: runtimer: bad p
fatal error: unexpected signal during runtime execution

runtime stack:
runtime.throw(0x6bc391, 0xf)
    /opt/go/src/runtime/panic.go:1116 +0x72
runtime.runtimer(0xc000024800, 0x10e54e6522d4816, 0x446357)
    /opt/go/src/runtime/time.go:745 +0x28c
runtime.checkTimers(0xc000024800, 0x10e54e6522d4816, 0x0, 0x0, 0xc000020000)
    /opt/go/src/runtime/proc.go:2768 +0x12c
runtime.findrunnable(0xc000022000, 0x0)
    /opt/go/src/runtime/proc.go:2275 +0x1aa
runtime.schedule()
    /opt/go/src/runtime/proc.go:2669 +0x2d7
runtime.park_m(0xc00016e000)
    /opt/go/src/runtime/proc.go:2837 +0x9d
runtime.mcall(0x0)
    /opt/go/src/runtime/asm_amd64.s:318 +0x5b

goroutine 1 [select]:
net/http.(*Transport).getConn(0x8777c0, 0xc000020300, 0x0, 0x6c26fd, 0x5, 0xc000016300, 0x12, 0x0, 0x0, 0x0, ...)
    /opt/go/src/net/http/transport.go:1347 +0x5ac
net/http.(*Transport).roundTrip(0x8777c0, 0xc00013a000, 0x30, 0xc000076bd0, 0x7fe9fa005528)
    /opt/go/src/net/http/transport.go:569 +0x77c
net/http.(*Transport).RoundTrip(0x8777c0, 0xc00013a000, 0x8777c0, 0x0, 0x0)
    /opt/go/src/net/http/roundtrip.go:17 +0x35
net/http.send(0xc00013a000, 0x70e540, 0x8777c0, 0x0, 0x0, 0x0, 0xc000010030, 0x203000, 0x1, 0x0)
    /opt/go/src/net/http/client.go:252 +0x453
net/http.(*Client).send(0x87cdc0, 0xc00013a000, 0x0, 0x0, 0x0, 0xc000010030, 0x0, 0x1, 0xc00013a000)
    /opt/go/src/net/http/client.go:176 +0xff
net/http.(*Client).do(0x87cdc0, 0xc00013a000, 0x0, 0x0, 0x0)
    /opt/go/src/net/http/client.go:718 +0x45f
net/http.(*Client).Do(...)
    /opt/go/src/net/http/client.go:586
net/http.(*Client).Get(0x87cdc0, 0x6c26fd, 0x21, 0xc00004df48, 0x4465ca, 0x83fe00)
    /opt/go/src/net/http/client.go:475 +0xbe
net/http.Get(...)
    /opt/go/src/net/http/client.go:447
main.main()
    /home/willem/test/crashtest.go:12 +0x54

goroutine 6 [chan receive]:
net/http.(*persistConn).addTLS(0xc00010c900, 0xc000016300, 0xe, 0x0, 0xc00001630f, 0x3)
    /opt/go/src/net/http/transport.go:1515 +0x1a5
net/http.(*Transport).dialConn(0x8777c0, 0x712a40, 0xc000018080, 0x0, 0x6c26fd, 0x5, 0xc000016300, 0x12, 0x0, 0xc00010c900, ...)
    /opt/go/src/net/http/transport.go:1585 +0x1d25
net/http.(*Transport).dialConnFor(0x8777c0, 0xc00011a2c0)
    /opt/go/src/net/http/transport.go:1421 +0xc6
created by net/http.(*Transport).queueForDial
    /opt/go/src/net/http/transport.go:1390 +0x40f

goroutine 10 [runnable]:
encoding/asn1.getUniversalType(0x7153a0, 0x686dc0, 0x540, 0x0, 0x0)
    /opt/go/src/encoding/asn1/common.go:149 +0x47d
encoding/asn1.parseField(0x686dc0, 0xc0005c90b0, 0x199, 0xc000412b3f, 0x12, 0x540, 0x0, 0x0, 0x0, 0x0, ...)
    /opt/go/src/encoding/asn1/asn1.go:781 +0xe65
encoding/asn1.parseSequenceOf(0xc000412b3f, 0x12, 0x540, 0x7153a0, 0x66b2e0, 0x7153a0, 0x686dc0, 0x66b201, 0xf, 0x0, ...)
    /opt/go/src/encoding/asn1/asn1.go:647 +0x4ba
encoding/asn1.parseField(0x66b2e0, 0xc0000a0478, 0x197, 0xc000412b30, 0x4d, 0x54f, 0xd, 0x0, 0x0, 0x0, ...)
    /opt/go/src/encoding/asn1/asn1.go:967 +0x20c5
encoding/asn1.parseSequenceOf(0xc000412b30, 0x4d, 0x54f, 0x7153a0, 0x672fe0, 0x7153a0, 0x66b2e0, 0x672f01, 0x2, 0x0, ...)
    /opt/go/src/encoding/asn1/asn1.go:647 +0x4ba
encoding/asn1.parseField(0x672fe0, 0xc0002377e0, 0x197, 0xc000412b2e, 0x4f, 0x551, 0x0, 0x0, 0x0, 0x0, ...)
    /opt/go/src/encoding/asn1/asn1.go:967 +0x20c5
encoding/asn1.UnmarshalWithParams(0xc000412b2e, 0x4f, 0x551, 0x670b20, 0xc0002377e0, 0x0, 0x0, 0xc000412bec, 0x0, 0x493, ...)
    /opt/go/src/encoding/asn1/asn1.go:1102 +0x1b6
encoding/asn1.Unmarshal(...)
    /opt/go/src/encoding/asn1/asn1.go:1095
crypto/x509.parseCertificate(0xc0004ae300, 0x56a, 0x57f, 0x65d640)
    /opt/go/src/crypto/x509/x509.go:1364 +0x3d0
crypto/x509.ParseCertificate(0xc000412b00, 0x56a, 0x57f, 0xc0005c8f90, 0xc00029b7fb, 0x1e9e3)
    /opt/go/src/crypto/x509/x509.go:1569 +0xf3
crypto/x509.(*CertPool).AppendCertsFromPEM(0xc0000779e0, 0xc000280000, 0x3a1de, 0x3a3de, 0x3a3de)
    /opt/go/src/crypto/x509/cert_pool.go:143 +0xff
crypto/x509.loadSystemRoots(0x0, 0x0, 0xc000179678)
    /opt/go/src/crypto/x509/root_unix.go:83 +0x4a5
crypto/x509.initSystemRoots()
    /opt/go/src/crypto/x509/root.go:23 +0x26
sync.(*Once).doSlow(0x8adfb0, 0x6d0180)
    /opt/go/src/sync/once.go:66 +0xec
sync.(*Once).Do(...)
    /opt/go/src/sync/once.go:57
crypto/x509.systemRootsPool(...)
    /opt/go/src/crypto/x509/root.go:18
crypto/x509.(*Certificate).Verify(0xc00012a580, 0xc000016300, 0xe, 0xc000077950, 0x0, 0xbfc5129aaf05ae1e, 0x140af78, 0x87d120, 0x0, 0x0, ...)
    /opt/go/src/crypto/x509/verify.go:776 +0x773
crypto/tls.(*Conn).verifyServerCertificate(0xc00007b180, 0xc000077440, 0x2, 0x2, 0x929, 0x0)
    /opt/go/src/crypto/tls/handshake_client.go:846 +0x29e
crypto/tls.(*clientHandshakeStateTLS13).readServerCertificate(0xc000179df0, 0x0, 0x0)
    /opt/go/src/crypto/tls/handshake_client_tls13.go:455 +0x25f
crypto/tls.(*clientHandshakeStateTLS13).handshake(0xc000179df0, 0xc000018668, 0x4)
    /opt/go/src/crypto/tls/handshake_client_tls13.go:85 +0x1fd
crypto/tls.(*Conn).clientHandshake(0xc00007b180, 0x0, 0x0)
    /opt/go/src/crypto/tls/handshake_client.go:209 +0x66b
crypto/tls.(*Conn).Handshake(0xc00007b180, 0x0, 0x0)
    /opt/go/src/crypto/tls/conn.go:1362 +0xc9
net/http.(*persistConn).addTLS.func2(0x0, 0xc00007b180, 0xc00006c410, 0xc000062900)
    /opt/go/src/net/http/transport.go:1509 +0x45
created by net/http.(*persistConn).addTLS
    /opt/go/src/net/http/transport.go:1505 +0x177
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x45570b]

runtime stack:
runtime.throw(0x6c5954, 0x2a)
    /opt/go/src/runtime/panic.go:1116 +0x72
runtime.sigpanic()
    /opt/go/src/runtime/signal_unix.go:704 +0x4ac
runtime.deltimer(0xc00006c000, 0xc000024800)
    /opt/go/src/runtime/time.go:308 +0xeb
time.stopTimer(...)
    /opt/go/src/runtime/time.go:214
runtime.wakeScavenger()
    /opt/go/src/runtime/mgcscavenge.go:197 +0x71
runtime.sysmon()
    /opt/go/src/runtime/proc.go:4709 +0x30a
runtime.mstart1()
    /opt/go/src/runtime/proc.go:1172 +0xc8
runtime.mstart()
    /opt/go/src/runtime/proc.go:1137 +0x6e
martisch commented 4 years ago

Wild guess could be https://github.com/golang/go/issues/35777 or similar bad effects due to signals from async preemption. Try running with GODEBUG=asyncpreemptoff=1 and see if that avoids the crashes.

It seems Debian 8 LTS support ended June 2020. Im not sure how new/patched the kernel is on that system.

gwillem commented 4 years ago

Thanks, GODEBUG=asyncpreemptoff=1 fixes the crash on this system (n=100).

I am puzzled though, #35777 mentions recent (5.x) kernels, while this case is about kernel 3.18.11.

Unfortunately I have no control over my user's kernel or command invocation. Shall I stick to go1.13.15 for my production builds? Or can we programmatically disable asyncpreempt to maximize compatibility with older kernels?

martisch commented 4 years ago

I think the next step is to figure out why async preemption interacts badly with the 3.18.11 kernel here (Debian 8).

For now you can programmatically disable asyncpreemption by setting GODEBUG=asyncpreemptoff=1 in your systems environment. Another option as the Linux system release isnt supported by the Distribution anymore one could upgrade the system and a newer kernel might solve the problem on its own.

/cc @prattmic @aclements @mknyszek

gwillem commented 4 years ago

Thanks. Unfortunately, I have no control over the system/env, and it seems that debug.asyncpreemptoff cannot be set from within a Go program, so I'll stick with 1.13 then.

https://golang.org/src/runtime/runtime1.go?h=asyncpreemptoff#L340

martisch commented 4 years ago

https://www.kernel.org/category/releases.html seems to suggest 3.18.11 is not supported anymore and quite old same for the debian installation. So im not sure how effective it will be finding the bug and if it is on the linux side it may likely not get fixed. There may however be interest in learning what the issue is and if it could be reproduced on newer kernels/linux installations (or reintroduced if not careful) if it isnt related to a known issue like: #35777

If you have control over your systems go installation you could change the go runtime code to have debug.asyncpreemptoff be always true regardless of GODEBUG.

ianlancetaylor commented 4 years ago

This certainly looks like memory corruption. It can't be exactly the same cause as #35777, but it could be related.

Is it possible for you to change osArchInit in runtime/os_linux_x86.go so that it always sets gsignalInitQuirk and throwReportQuirk? Currently it only sets those on Linux kernel versions that are known to have the bug described at https://golang.org/wiki/LinuxKernelSignalVectorBug. But it's possible that your kernel has a different bug with similar effect: failing to correctly restore some registers when returning from a signal. It would be interesting to learn whether setting the Quirk variables, which will mlock the signal stack, fixes the problem.

gwillem commented 4 years ago

Sorry, problem persists, using 1.15 and:

func osArchInit() {
    gsignalInitQuirk = mlockGsignal
    if m0.gsignal != nil {
        throw("gsignal quirk too late")
    }
    throwReportQuirk = throwBadKernel
}

FYI @ianlancetaylor the error output was runtimer: bad p (with -r) https://golang.org/src/runtime/time.go#L745

ianlancetaylor commented 4 years ago

Thanks for trying that.

So this still looks like memory corruption, and it seems to be related to signal handling, and as far as we know it only happens on older Linux kernels.

I'm not sure what to suggest.