Closed bcmills closed 4 years ago
2020-04-13T22:38:56-ca017a6/windows-amd64-longtest
--- FAIL: TestScript (0.01s)
--- FAIL: TestScript/test_go111module_cache (1.22s)
script_test.go:205:
> env GO111MODULE=on
> go mod init foo
[stderr]
go: creating new go.mod: module foo
> go test
[stdout]
exit status 3221225477
FAIL foo 0.267s
[exit status 1]
FAIL: testdata\script\test_go111module_cache.txt:3: unexpected command failure
FAIL
FAIL cmd/go 95.006s
That makes two in the last month, and none from the entire year before. Absent a specific theory of what's wrong, I think we need to treat this as a potential 1.15 compiler regression.
CC @randall77 @josharian
Or a runtime regression. The other repeating mystery failure is on windows/386, which has windows in common. I took a quick peek through git history around 1/9/2020 and CL 213837 might be a candidate(?).
Ooh, CL 213837 is an interesting hypothesis. (CC @aclements @cherrymui @dr2chase @ianlancetaylor).
(For reference, the windows/386
issue is #37360.)
There is also mysterious #36492 bug.
Alex
2020-04-30T02:29:55-eda6fe3/windows-amd64-longtest 2020-04-29T21:29:17-faafdf5/windows-amd64-longtest 2020-04-29T21:29:08-b3863fb/windows-amd64-longtest 2020-04-29T20:33:31-197a2a3/windows-amd64-longtest 2020-04-29T19:03:10-35dce1d/windows-amd64-longtest 2020-04-29T18:56:19-b13ce66/windows-amd64-longtest 2020-04-29T18:50:32-202c43b/windows-amd64-longtest 2020-04-29T08:57:33-e1a96b8/windows-amd64-longtest 2020-04-28T18:41:07-39380e8/windows-amd64-longtest 2020-04-27T20:06:57-d2f5e4e/windows-amd64-longtest 2020-04-27T18:19:26-a136919/windows-amd64-longtest 2020-04-27T17:49:51-8af8fa9/windows-amd64-longtest 2020-04-27T17:20:00-1c4e9b2/windows-amd64-longtest 2020-04-27T17:19:27-f4e13b8/windows-amd64-2016 2020-04-27T17:19:16-5029c36/windows-amd64-longtest 2020-04-27T15:53:46-9b9556f/windows-amd64-2016 2020-04-27T10:23:01-925d6b3/windows-amd64-longtest 2020-04-25T18:40:45-7ef28cb/windows-amd64-longtest 2020-04-25T02:19:12-49f10f3/windows-amd64-longtest 2020-04-24T23:58:38-396833c/windows-amd64-longtest 2020-04-24T23:12:43-b0a8754/windows-amd64-longtest 2020-04-24T23:12:23-6b5ab20/windows-amd64-longtest 2020-04-24T23:06:59-1c54535/windows-amd64-longtest 2020-04-24T23:06:48-47b5efa/windows-amd64-longtest 2020-04-24T22:01:49-d8d3542/windows-amd64-longtest 2020-04-24T21:12:21-943a0d0/windows-amd64-longtest 2020-04-24T17:13:12-da33f9c/windows-amd64-longtest 2020-04-24T13:26:34-ee8972c/windows-amd64-longtest 2020-04-23T19:40:07-6303c34/windows-amd64-longtest 2020-04-22T16:19:35-1257900/windows-amd64-2016
The same exit status on a 32-bit system in a trybot run as exit status -1073741819.
https://storage.googleapis.com/go-build-log/bbdccf35/windows-386-2008_b817f391.log
I don't see any in the saved logs, though, so that might have really been corruption of some sort.
It's strange to see this exit status. It should be handled as an exception by the runtime package (code in runtime/signal_windows.go). Even if the runtime package doesn't handle the exception, it should still cause a stack trace and an exit with status 2. I don't know under what conditions Windows can still have a 0xc0000005 exit status.
The error seems to tend to happen in cmd/go TestScript
or when running the tests in the test directory, which are both cases where we are executing a large number of child processes in parallel.
I don't know what is going on here; not looking at it any further at present.
It's strange to see this exit status.
The process returned ERROR_ACCESS_DENIED. ERROR_ACCESS_DENIED is just another Windows error. There is no restriction on what error process can and cannot return.
It should be handled as an exception by the runtime package (code in runtime/signal_windows.go).
I agree,
Mind you, if something goes wrong in exception handle, it won't exit the process properly, and Windows itself will exit process.
Also, maybe we have a bug and exception handler does not run for some reason.
Even if the runtime package doesn't handle the exception, it should still cause a stack trace and an exit with status 2.
The exception handler can fail as it is printing stack trace.
I don't know under what conditions Windows can still have a 0xc0000005 exit status.
If I run this program
https://play.golang.org/p/r_g4sD76FOh
it prints
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x9853ca]
goroutine 1 [running]:
main.main()
/home/a/src/issues/go/38440/a.go:9 +0x2a
on Windows. I reckon (I did not test) the program will return 0xc0000005 exit code, if I disable exception handler.
Do we know which program is failing? I still don't see which program failed from test output. Maybe, if we know which program fails, we can add some temporary debugging code to it.
Alex
Alex
It's a range of different programs that are failing. The only consistency I've seen so far is that it happens in the cmd/go testsuite and in the top-level test testsuite, both of which run many different Go programs in parallel. Some of the Go programs that they run in parallel fail with exit status 0xc0000005
and no other output, implying, as you say, that the exception handler did not take effect for some reason.
I'm wondering if it's possible that when memory allocation fails while calling into Windows, such as when starting a new thread, that this exit status might occur. That is pure speculation, though; I have no evidence.
I'm wondering if it's possible that when memory allocation fails while calling into Windows, such as when starting a new thread, that this exit status might occur.
The way threads are created by Go runtime is by calling CreateThread Windows API (in runtime.newosproc). So, if CreateThread fails and returns an error, it should be printed by runtime. Same when Go allocates memory.
I also considered bug in how we implemented exception handling. Windows exception handling on 386 is very different from amd64. The fact that both 368 and amd64 builders fail with this error points that the problem is not exception handler related.
Maybe crashing program does print at least some stack trace, but its output is lost by the parent process. Maybe stdout is not read for some reason by parent process. The fact that exit code is 0xc0000005 does not supports that theory, but ...
Alex
It seems this became much more common in early April.
$ greplogs -dashboard -E "exit status 3221225477" -l | findflakes -paths
First observed ea7126f 10 Apr 16:24 2020 (455 commits ago)
Last observed 364a05e 14 May 15:18 2020 (2 commits ago)
76% chance failure is still happening
13% failure probability (59 of 454 commits)
Likely culprits:
13% ea7126f cmd/compile: use a Sym type instead of interface{} for symbolic offsets
11% 782fcb4 regexp: add (*Regexp).SubexpIndex
10% 245409e runtime: replace the type of netpollWakeSig from a uintptr to a uint32
8% 7b90c1c os/exec: extend grace period in TestExtraFiles to 20% of overall deadline
7% 28157b3 cmd/compile: start implementing strongly typed aux and auxint fields
6% a1550d3 cmd/compile: use isel with variable shifts on ppc64x
6% 7c0ee11 test: deflaking measures for runtime gdb test
5% b191c60 cmd/oldlink: delete darwin/386 host object loading support
4% a3ef804 cmd/internal/obj/ppc64: leverage AND operation to calculate remainder
4% c1f0eda net: convert many Close tests to use parallel subtests
Past failures:
8a958bb 29 Apr 16:24 2018 to ad4e637 04 Jul 00:15 2018
0.14% failure probability (2 of 695 commits)
08c4348 10 Apr 18:40 2015 (isolated failure)
Hello! This is one of the few remaining issues blocking the Beta release of Go 1.15. We'll need to make a decision on this in the next week in order to keep our release on schedule.
I'm working on bisecting this (which is a very slow process; given the 13% failure probability, it takes 34 successful runs to drive the chance of a missed repro under 1%). I have been able to reproduce it as far back as da8591b61c141ca58e4b3eae382d2827006344fd, which is 32 commits earlier than the first observation on the dashboard (ea7126fe141879e1065d196570c078fbec09f3b6).
Reproduced at e31d741801514f2a008625fd246644d2214f4516, which is 64 commits before the first observation on the dashboard.
Given the failure probability and the clear cut-off on the dashboard, I'm starting to wonder if something changed on the builder, rather than in the tree. Assuming it had the same 13% failure probability before the first dashboard observation, there's only a (1-0.13)**64 = 0.01% chance it wouldn't have appeared somewhere in those 64 commits.
For some reason (I don’t recall why), when I first looked at this in https://github.com/golang/go/issues/38440#issuecomment-614102077, I thought the first occurrence was on 1/9/2020. Have you looked around that date range?
This is definitely not happening at tag go1.14.3.
@josharian, I haven't looked around that range yet. Do you have any recollection where 1/9/2020 come from? :)
I should note that go1.14.3 included CL 213837, which was your hypothesized culprit, but I wasn't able to reproduce at that tag (I even accidentally ran a lot more iterations than I meant to!)
Ah, I think it's because of https://github.com/golang/go/issues/37360#issuecomment-614060591, which I mentally transposed into this issue. They may be related, in that they are both weird exit statuses on Windows.
For lack of a better idea, I'm now running git bisect start da8591b61c141ca58e4b3eae382d2827006344fd go1.14.3
. My reproduction steps are
VM=$(gomote create windows-amd64-longtest)
gomote push $VM
timeout 45m gomote run -path '$PATH,$WORKDIR/go/bin' -e GO_TEST_SHORT=0 -e GO_TEST_TIMEOUT_SCALE=5 $VM go/src/all.bat
I've wrapped this in a horrible shell script that considers "exit status 3221225477" to be a failure, "ALL TESTS PASSED" to be a success, and anything else to be a flake. And I'm running it across five gomotes so it doesn't take an eternity and a half.
Almost there. Looking at the remaining commits, my guess is it's 4ec5e7c50fb2d7fb158578e620e026779c3aafe3, but I'll see the bisect through and do some (even longer!) verification runs.
For the record, here's my current bisect log:
# bad: [da8591b61c141ca58e4b3eae382d2827006344fd] all: remove darwin/arm build-tags and files # good: [f296b7a6f045325a230f77e9bda1470b1270f817] [release-branch.go1.14] go1.14.3 git bisect start 'da8591b61c141ca58e4b3eae382d2827006344fd' 'go1.14.3' # good: [1cd724acb6304d30d8998d14a5469fbab24dd3b1] doc/go1.14: highlight the addition of hash/maphash package git bisect good 1cd724acb6304d30d8998d14a5469fbab24dd3b1 # good: [14d20dc4f38dda548f6991cc687bc71b250b1603] cmd/go: replace TestCgoDependsOnSyscall with a simpler script test git bisect good 14d20dc4f38dda548f6991cc687bc71b250b1603 # good: [9baafabac9a84813a336f068862207d2bb06d255] crypto/rsa: refactor RSA-PSS signing and verification git bisect good 9baafabac9a84813a336f068862207d2bb06d255 # good: [0e06775eebf6d357f604a55c803aab5cc8054e67] [dev.link] cmd/link: don't truncate external relocation type git bisect good 0e06775eebf6d357f604a55c803aab5cc8054e67 # good: [1cb582fe026d22cc886634f49ec0be63bcca911f] [dev.link] cmd/link: fix end-of-block padding (again) git bisect good 1cb582fe026d22cc886634f49ec0be63bcca911f # good: [1cb582fe026d22cc886634f49ec0be63bcca911f] [dev.link] cmd/link: fix end-of-block padding (again) git bisect good 1cb582fe026d22cc886634f49ec0be63bcca911f # bad: [fff7509d472778cae5e652dbe2479929c666c24f] cmd/compile: add intrinsic HasCPUFeature for checking cpu features git bisect bad fff7509d472778cae5e652dbe2479929c666c24f # bad: [888a0c8ef6afb752aafd147eda40d62796d87cb3] testing: add TB.TempDir git bisect bad 888a0c8ef6afb752aafd147eda40d62796d87cb3 # bad: [cc3039f818a372274180834385718d2dfccb1a26] [dev.link] cmd/internal/goobj2: remove old-style accessors git bisect bad cc3039f818a372274180834385718d2dfccb1a26
That's an interesting theory...
- fmap, err := syscall.CreateFileMapping(syscall.Handle(out.f.Fd()), nil, syscall.PAGE_READONLY, high, low, nil)
+ fmap, err := syscall.CreateFileMapping(syscall.Handle(out.f.Fd()), nil, syscall.PAGE_READWRITE, high, low, nil)
I wouldn't expect access violations when adding more permissions to a mapping...if we were still partying like it's 1999. But with DEP, we could be hitting that if we're trying to execute code out of those pages. I haven't looked to see what the linker is using this area for at all, but it might be worth trying to change that to PAGE_EXECUTE_READWRITE
, and see if that makes a difference. If it does, then we can go back and revisit the situation in general.
I don't recall the details, but I believe the mapping actually never worked before.that change and we would silently fall back to reading the object file into the heap. So it's not actually that we added more permissions, but that we went from not using a mapping to using a mapping. @cherrymui, do you recall?
I don't recall the details, but I believe the mapping actually never worked before
Yes, exactly. Before that CL the mapping actually failed and we just fall back to file I/O.
Actually I suspect this CL as well. Yesterday I spent some time re-reading the source code and some documents on MSDN, but couldn't find anything really wrong. Since I don't know how to reliably test this, I didn't try it myself. @aclements, if you can reliably test this, you could try disabling mmap on windows and see.
One place that looks suspicious to me is that we close the file mapping handle while still using the mapping, https://tip.golang.org/src/cmd/link/internal/ld/outbuf_windows.go#L24 . But reading the documents it's not clear to me whether there is any problem.
@zx2c4 this is the linker writing the output binary. It doesn't execute anything in the mapping. (But likely a subsequent command will, in a separate process.)
Change https://golang.org/cl/235637 mentions this issue: cmd/link: disable mmap on Windows for now
Change https://golang.org/cl/235638 mentions this issue: cmd/link: close file mapping handle after unmapping
@aclements if you have a way to test this reliably, does either of the CLs above help? Thanks.
Change https://golang.org/cl/235639 mentions this issue: cmd/link: flush file mapping before unmapping
Also, does flushing the mapping help? Thanks...
Bisect is done and implicates ff29f97437765abb4dd47a239c3cca274175d5e0, which is one after the suspected cause of 4ec5e7c50fb2d7fb158578e620e026779c3aafe3. I ran 4ec5e7c50fb2d7fb158578e620e026779c3aafe3 65 times with no failures, so either it's not triggering for some unclear reason, or the actual bug is from a change in the object decoder that caused it to go off the end of the mapping.
I'll try running with those two CLs. If it is actually a bounds bug in the decoder, that may mask it, or turn it into a Go bounds panic.
Here's my full bisect log:
# bad: [da8591b61c141ca58e4b3eae382d2827006344fd] all: remove darwin/arm build-tags and files # good: [f296b7a6f045325a230f77e9bda1470b1270f817] [release-branch.go1.14] go1.14.3 git bisect start 'da8591b61c141ca58e4b3eae382d2827006344fd' 'go1.14.3' # good: [1cd724acb6304d30d8998d14a5469fbab24dd3b1] doc/go1.14: highlight the addition of hash/maphash package git bisect good 1cd724acb6304d30d8998d14a5469fbab24dd3b1 # good: [14d20dc4f38dda548f6991cc687bc71b250b1603] cmd/go: replace TestCgoDependsOnSyscall with a simpler script test git bisect good 14d20dc4f38dda548f6991cc687bc71b250b1603 # good: [9baafabac9a84813a336f068862207d2bb06d255] crypto/rsa: refactor RSA-PSS signing and verification git bisect good 9baafabac9a84813a336f068862207d2bb06d255 # good: [0e06775eebf6d357f604a55c803aab5cc8054e67] [dev.link] cmd/link: don't truncate external relocation type git bisect good 0e06775eebf6d357f604a55c803aab5cc8054e67 # good: [1cb582fe026d22cc886634f49ec0be63bcca911f] [dev.link] cmd/link: fix end-of-block padding (again) git bisect good 1cb582fe026d22cc886634f49ec0be63bcca911f # good: [1cb582fe026d22cc886634f49ec0be63bcca911f] [dev.link] cmd/link: fix end-of-block padding (again) git bisect good 1cb582fe026d22cc886634f49ec0be63bcca911f # bad: [fff7509d472778cae5e652dbe2479929c666c24f] cmd/compile: add intrinsic HasCPUFeature for checking cpu features git bisect bad fff7509d472778cae5e652dbe2479929c666c24f # bad: [888a0c8ef6afb752aafd147eda40d62796d87cb3] testing: add TB.TempDir git bisect bad 888a0c8ef6afb752aafd147eda40d62796d87cb3 # bad: [cc3039f818a372274180834385718d2dfccb1a26] [dev.link] cmd/internal/goobj2: remove old-style accessors git bisect bad cc3039f818a372274180834385718d2dfccb1a26 # bad: [6dab0942a43039b8dea365ba8629cdc519cfb143] [dev.link] cmd/internal/goobj2, cmd/link: add and use new Sym accessors git bisect bad 6dab0942a43039b8dea365ba8629cdc519cfb143 # bad: [ff29f97437765abb4dd47a239c3cca274175d5e0] [dev.link] cmd/internal/goobj2: change StringRef encoding for better locality git bisect bad ff29f97437765abb4dd47a239c3cca274175d5e0 # good: [4ec5e7c50fb2d7fb158578e620e026779c3aafe3] [dev.link] cmd/link: fix mmapping OutBuf on windows git bisect good 4ec5e7c50fb2d7fb158578e620e026779c3aafe3 # first bad commit: [ff29f97437765abb4dd47a239c3cca274175d5e0] [dev.link] cmd/internal/goobj2: change StringRef encoding for better locality
That's interesting. If it is some bounds bug in the decoder, I would expect the failure be more deterministic.
@cherrymui, maybe this is another one of those page-alignment bugs, where the failure only triggers if the bound happens to be aligned to a page boundary?
Do we know which process is failing? It typically fails at "go run" or "go test". It could be the go command, the compiler, the linker, or the generated executable that fails. If it is some bounds bug, I guess the linker will fail. If there is something weird with the file mapping, I guess it could be the linker or the executable.
Given that it typically fails when there are a large number of child processes running in parallel, my guess is that it is more likely the generated executable.
It fails when there are lot of tests running in parallel, where each test involves building a Go program and then in some cases running it. As far as I can see this could be anywhere in the toolchain: cmd/go, cmd/compile, cmd/link, the final executable.
I don't think the failure messages are consistent with the final test executable failing. If that were the case, cmd/go
would typically print a final FAIL
line:
$ cat > main_test.go
package main
import (
"os"
"testing"
)
func TestExitNonzero(t *testing.T) {
os.Exit(0xC0000005)
}
$ go test ./main_test.go
exit status 5
FAIL command-line-arguments 0.019s
FAIL
Might be worth trying calling FlushViewOfFile
to see if that makes a difference.
Might be worth trying calling
FlushViewOfFile
to see if that makes a difference.
It will be interesting if that works. MapViewOfFile isn't consistent with the ordinary read/write calls for file I/O, until you call FlushViewOfFile (although I would expect for unmap to internally do a flush). If this is a case in which the linker needs to call that before the binary is subsequently executed, then it would indicate that NT's PE loader is also using a file loading mechanism that becomes inconsistent with MapViewOfFile until FlushViewOfFile is called. That would surprise me, because executable image loaders since forever have just mmap'd the various sections of the binary, sharing pages across instances. On the other hand, there's also Windows file locking, which I could imagine interacts strangely for some edge cases.
Either way, anxiously awaiting your results, @aclements .
Do we know which process is failing?
I don't think we know.
We can see "exit status 0xC0000005" in the error output. The message is parent process reporting child process failure. I wonder, if we could add debug code to the parent process to print more information there. Maybe we could include command name and argument list there. Extend whatever os package structures are required to print that information in exit string. Just for debugging this.
Sorry I don't have time to try it now.
Alex
CL 235639 (close file mapping handle after unmapping + flush file mapping before unmapping) did not fail after 35 runs. I'm starting a second run to give it more bake time.
For my reference, the command I'm using:
gopool2 create -setup 'gomote push $VM' windows-amd64-longtest
stress2 -timeout 45m -p 7 -max-runs 35 -max-total-runs 100 -pass "ALL TESTS PASSED" -fail "exit status 3221225477" gopool2 run 'gomote run -e GO_TEST_SHORT=0 -e GO_TEST_TIMEOUT_SCALE=5 $VM go/src/all.bat'
cc @jstarks for any insights...
32 more runs of CL 235639 (close file mapping handle after unmapping + flush file mapping before unmapping) with no failures (then my gomote pool died).
I would certainly like to understand this better, but for now I think we can land this CL to keep the beta on track. I'm going to do a run with just CL 235639 (flushing without the late close) to isolate things a little better.
Thanks @aclements ! Rebased CL 235639 so it is not on top of the other CL.
Observed in 2020-04-10T16:24:46-ea7126f/windows-amd64-longtest:
The
FAIL
int/t1
is unexpected in this test.Exit status 3221225477 is
0xC0000005
, which some cursory searching seems to suggest is a generic “access violation” error. That suggests possible memory corruption.I haven't seen any repeats of this error so far.
CC @alexbrainman @zx2c4