Open gwillem opened 4 years ago
Wild guess could be https://github.com/golang/go/issues/35777 or similar bad effects due to signals from async preemption. Try running with GODEBUG=asyncpreemptoff=1 and see if that avoids the crashes.
It seems Debian 8 LTS support ended June 2020. Im not sure how new/patched the kernel is on that system.
Thanks, GODEBUG=asyncpreemptoff=1
fixes the crash on this system (n=100).
I am puzzled though, #35777 mentions recent (5.x) kernels, while this case is about kernel 3.18.11.
Unfortunately I have no control over my user's kernel or command invocation. Shall I stick to go1.13.15 for my production builds? Or can we programmatically disable asyncpreempt to maximize compatibility with older kernels?
I think the next step is to figure out why async preemption interacts badly with the 3.18.11 kernel here (Debian 8).
For now you can programmatically disable asyncpreemption by setting GODEBUG=asyncpreemptoff=1
in your systems environment. Another option as the Linux system release isnt supported by the Distribution anymore one could upgrade the system and a newer kernel might solve the problem on its own.
/cc @prattmic @aclements @mknyszek
Thanks. Unfortunately, I have no control over the system/env, and it seems that debug.asyncpreemptoff cannot be set from within a Go program, so I'll stick with 1.13 then.
https://golang.org/src/runtime/runtime1.go?h=asyncpreemptoff#L340
https://www.kernel.org/category/releases.html seems to suggest 3.18.11 is not supported anymore and quite old same for the debian installation. So im not sure how effective it will be finding the bug and if it is on the linux side it may likely not get fixed. There may however be interest in learning what the issue is and if it could be reproduced on newer kernels/linux installations (or reintroduced if not careful) if it isnt related to a known issue like: #35777
If you have control over your systems go installation you could change the go runtime code to have debug.asyncpreemptoff be always true regardless of GODEBUG.
This certainly looks like memory corruption. It can't be exactly the same cause as #35777, but it could be related.
Is it possible for you to change osArchInit
in runtime/os_linux_x86.go so that it always sets gsignalInitQuirk
and throwReportQuirk
? Currently it only sets those on Linux kernel versions that are known to have the bug described at https://golang.org/wiki/LinuxKernelSignalVectorBug. But it's possible that your kernel has a different bug with similar effect: failing to correctly restore some registers when returning from a signal. It would be interesting to learn whether setting the Quirk
variables, which will mlock
the signal stack, fixes the problem.
Sorry, problem persists, using 1.15 and:
func osArchInit() {
gsignalInitQuirk = mlockGsignal
if m0.gsignal != nil {
throw("gsignal quirk too late")
}
throwReportQuirk = throwBadKernel
}
FYI @ianlancetaylor the error output was runtimer: bad p
(with -r) https://golang.org/src/runtime/time.go#L745
Thanks for trying that.
So this still looks like memory corruption, and it seems to be related to signal handling, and as far as we know it only happens on older Linux kernels.
I'm not sure what to suggest.
Summary: a simple test program built with go1.14 & go1.15 crashes randomly (~92% cases) on a specific Linux server. No problem with 1.13.15.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Working builds:
1.13.7
1.13.15
Crashing builds:1.14
1.14.4
1.14.6
1.15
What operating system and processor architecture are you using (
go env
)?Build environment:
go env
OutputRuntime environment:
What did you do?
crashtest.go:
Test runner:
What did you see instead?
When ran with a few hundred tests, it crashes in ~92% of cases. As you can see, the error message differs. For 200 tests, these are the counted first lines of the errors:
Sample stacktrace: