golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.4k stars 17.59k forks source link

runtime: exit status 0xC0000005 from test process on windows-amd64-longtest builder #38440

Closed bcmills closed 4 years ago

bcmills commented 4 years ago

Observed in 2020-04-10T16:24:46-ea7126f/windows-amd64-longtest:

windows-amd64-longtest …
…
--- FAIL: TestTestCache (4.15s)
    go_test.go:2405: 

        INITIAL

    go_test.go:2413: running testgo [test -x -v -short t/...]
    go_test.go:2413: standard output:
    go_test.go:2413: exit status 3221225477
        FAIL    t/t1    0.486s
        exit status 3221225477
        FAIL    t/t2    0.481s
        === RUN   Test3
            Test3: t3_test.go:6: 1
        --- PASS: Test3 (0.00s)
        PASS
        ok      t/t3    0.462s
        === RUN   Test4
            Test4: t4_test.go:6: 1
        --- PASS: Test4 (0.00s)
        PASS
        ok      t/t4    0.467s
        FAIL

    go_test.go:2413: standard error:

The FAIL in t/t1 is unexpected in this test.

Exit status 3221225477 is 0xC0000005, which some cursory searching seems to suggest is a generic “access violation” error. That suggests possible memory corruption.

I haven't seen any repeats of this error so far.

CC @alexbrainman @zx2c4

bcmills commented 4 years ago

2020-04-13T22:38:56-ca017a6/windows-amd64-longtest

--- FAIL: TestScript (0.01s)
    --- FAIL: TestScript/test_go111module_cache (1.22s)
        script_test.go:205: 
            > env GO111MODULE=on
            > go mod init foo
            [stderr]
            go: creating new go.mod: module foo
            > go test
            [stdout]
            exit status 3221225477
            FAIL    foo 0.267s
            [exit status 1]
            FAIL: testdata\script\test_go111module_cache.txt:3: unexpected command failure

FAIL
FAIL    cmd/go  95.006s
bcmills commented 4 years ago

That makes two in the last month, and none from the entire year before. Absent a specific theory of what's wrong, I think we need to treat this as a potential 1.15 compiler regression.

CC @randall77 @josharian

josharian commented 4 years ago

Or a runtime regression. The other repeating mystery failure is on windows/386, which has windows in common. I took a quick peek through git history around 1/9/2020 and CL 213837 might be a candidate(?).

bcmills commented 4 years ago

Ooh, CL 213837 is an interesting hypothesis. (CC @aclements @cherrymui @dr2chase @ianlancetaylor).

(For reference, the windows/386 issue is #37360.)

alexbrainman commented 4 years ago

There is also mysterious #36492 bug.

Alex

bcmills commented 4 years ago

2020-04-21T05:56:02-4974ac6/windows-amd64-longtest 2020-04-20T16:36:22-e851873/windows-amd64-longtest

bcmills commented 4 years ago

2020-04-21T20:48:43-1811533/windows-amd64-longtest

bcmills commented 4 years ago

2020-04-30T02:29:55-eda6fe3/windows-amd64-longtest 2020-04-29T21:29:17-faafdf5/windows-amd64-longtest 2020-04-29T21:29:08-b3863fb/windows-amd64-longtest 2020-04-29T20:33:31-197a2a3/windows-amd64-longtest 2020-04-29T19:03:10-35dce1d/windows-amd64-longtest 2020-04-29T18:56:19-b13ce66/windows-amd64-longtest 2020-04-29T18:50:32-202c43b/windows-amd64-longtest 2020-04-29T08:57:33-e1a96b8/windows-amd64-longtest 2020-04-28T18:41:07-39380e8/windows-amd64-longtest 2020-04-27T20:06:57-d2f5e4e/windows-amd64-longtest 2020-04-27T18:19:26-a136919/windows-amd64-longtest 2020-04-27T17:49:51-8af8fa9/windows-amd64-longtest 2020-04-27T17:20:00-1c4e9b2/windows-amd64-longtest 2020-04-27T17:19:27-f4e13b8/windows-amd64-2016 2020-04-27T17:19:16-5029c36/windows-amd64-longtest 2020-04-27T15:53:46-9b9556f/windows-amd64-2016 2020-04-27T10:23:01-925d6b3/windows-amd64-longtest 2020-04-25T18:40:45-7ef28cb/windows-amd64-longtest 2020-04-25T02:19:12-49f10f3/windows-amd64-longtest 2020-04-24T23:58:38-396833c/windows-amd64-longtest 2020-04-24T23:12:43-b0a8754/windows-amd64-longtest 2020-04-24T23:12:23-6b5ab20/windows-amd64-longtest 2020-04-24T23:06:59-1c54535/windows-amd64-longtest 2020-04-24T23:06:48-47b5efa/windows-amd64-longtest 2020-04-24T22:01:49-d8d3542/windows-amd64-longtest 2020-04-24T21:12:21-943a0d0/windows-amd64-longtest 2020-04-24T17:13:12-da33f9c/windows-amd64-longtest 2020-04-24T13:26:34-ee8972c/windows-amd64-longtest 2020-04-23T19:40:07-6303c34/windows-amd64-longtest 2020-04-22T16:19:35-1257900/windows-amd64-2016

ianlancetaylor commented 4 years ago

The same exit status on a 32-bit system in a trybot run as exit status -1073741819.

https://storage.googleapis.com/go-build-log/bbdccf35/windows-386-2008_b817f391.log

I don't see any in the saved logs, though, so that might have really been corruption of some sort.

ianlancetaylor commented 4 years ago

It's strange to see this exit status. It should be handled as an exception by the runtime package (code in runtime/signal_windows.go). Even if the runtime package doesn't handle the exception, it should still cause a stack trace and an exit with status 2. I don't know under what conditions Windows can still have a 0xc0000005 exit status.

ianlancetaylor commented 4 years ago

The error seems to tend to happen in cmd/go TestScript or when running the tests in the test directory, which are both cases where we are executing a large number of child processes in parallel.

ianlancetaylor commented 4 years ago

I don't know what is going on here; not looking at it any further at present.

alexbrainman commented 4 years ago

It's strange to see this exit status.

The process returned ERROR_ACCESS_DENIED. ERROR_ACCESS_DENIED is just another Windows error. There is no restriction on what error process can and cannot return.

It should be handled as an exception by the runtime package (code in runtime/signal_windows.go).

I agree,

Mind you, if something goes wrong in exception handle, it won't exit the process properly, and Windows itself will exit process.

Also, maybe we have a bug and exception handler does not run for some reason.

Even if the runtime package doesn't handle the exception, it should still cause a stack trace and an exit with status 2.

The exception handler can fail as it is printing stack trace.

I don't know under what conditions Windows can still have a 0xc0000005 exit status.

If I run this program

https://play.golang.org/p/r_g4sD76FOh

it prints

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x9853ca]

goroutine 1 [running]:
main.main()
        /home/a/src/issues/go/38440/a.go:9 +0x2a

on Windows. I reckon (I did not test) the program will return 0xc0000005 exit code, if I disable exception handler.

Do we know which program is failing? I still don't see which program failed from test output. Maybe, if we know which program fails, we can add some temporary debugging code to it.

Alex

Alex

ianlancetaylor commented 4 years ago

It's a range of different programs that are failing. The only consistency I've seen so far is that it happens in the cmd/go testsuite and in the top-level test testsuite, both of which run many different Go programs in parallel. Some of the Go programs that they run in parallel fail with exit status 0xc0000005 and no other output, implying, as you say, that the exception handler did not take effect for some reason.

I'm wondering if it's possible that when memory allocation fails while calling into Windows, such as when starting a new thread, that this exit status might occur. That is pure speculation, though; I have no evidence.

alexbrainman commented 4 years ago

I'm wondering if it's possible that when memory allocation fails while calling into Windows, such as when starting a new thread, that this exit status might occur.

The way threads are created by Go runtime is by calling CreateThread Windows API (in runtime.newosproc). So, if CreateThread fails and returns an error, it should be printed by runtime. Same when Go allocates memory.

I also considered bug in how we implemented exception handling. Windows exception handling on 386 is very different from amd64. The fact that both 368 and amd64 builders fail with this error points that the problem is not exception handler related.

Maybe crashing program does print at least some stack trace, but its output is lost by the parent process. Maybe stdout is not read for some reason by parent process. The fact that exit code is 0xc0000005 does not supports that theory, but ...

Alex

aclements commented 4 years ago

It seems this became much more common in early April.

$ greplogs -dashboard -E "exit status 3221225477" -l | findflakes -paths
First observed ea7126f 10 Apr 16:24 2020 (455 commits ago)
Last observed  364a05e 14 May 15:18 2020 (2 commits ago)
76% chance failure is still happening
13% failure probability (59 of 454 commits)
Likely culprits:
   13% ea7126f cmd/compile: use a Sym type instead of interface{} for symbolic offsets
   11% 782fcb4 regexp: add (*Regexp).SubexpIndex
   10% 245409e runtime: replace the type of netpollWakeSig from a uintptr to a uint32
    8% 7b90c1c os/exec: extend grace period in TestExtraFiles to 20% of overall deadline
    7% 28157b3 cmd/compile: start implementing strongly typed aux and auxint fields
    6% a1550d3 cmd/compile: use isel with variable shifts on ppc64x
    6% 7c0ee11 test: deflaking measures for runtime gdb test
    5% b191c60 cmd/oldlink: delete darwin/386 host object loading support
    4% a3ef804 cmd/internal/obj/ppc64: leverage AND operation to calculate remainder
    4% c1f0eda net: convert many Close tests to use parallel subtests
Past failures:
  8a958bb 29 Apr 16:24 2018 to ad4e637 04 Jul 00:15 2018
    0.14% failure probability (2 of 695 commits)
  08c4348 10 Apr 18:40 2015 (isolated failure)
toothrot commented 4 years ago

Hello! This is one of the few remaining issues blocking the Beta release of Go 1.15. We'll need to make a decision on this in the next week in order to keep our release on schedule.

aclements commented 4 years ago

I'm working on bisecting this (which is a very slow process; given the 13% failure probability, it takes 34 successful runs to drive the chance of a missed repro under 1%). I have been able to reproduce it as far back as da8591b61c141ca58e4b3eae382d2827006344fd, which is 32 commits earlier than the first observation on the dashboard (ea7126fe141879e1065d196570c078fbec09f3b6).

aclements commented 4 years ago

Reproduced at e31d741801514f2a008625fd246644d2214f4516, which is 64 commits before the first observation on the dashboard.

Given the failure probability and the clear cut-off on the dashboard, I'm starting to wonder if something changed on the builder, rather than in the tree. Assuming it had the same 13% failure probability before the first dashboard observation, there's only a (1-0.13)**64 = 0.01% chance it wouldn't have appeared somewhere in those 64 commits.

josharian commented 4 years ago

For some reason (I don’t recall why), when I first looked at this in https://github.com/golang/go/issues/38440#issuecomment-614102077, I thought the first occurrence was on 1/9/2020. Have you looked around that date range?

aclements commented 4 years ago

This is definitely not happening at tag go1.14.3.

@josharian, I haven't looked around that range yet. Do you have any recollection where 1/9/2020 come from? :)

aclements commented 4 years ago

I should note that go1.14.3 included CL 213837, which was your hypothesized culprit, but I wasn't able to reproduce at that tag (I even accidentally ran a lot more iterations than I meant to!)

josharian commented 4 years ago

Ah, I think it's because of https://github.com/golang/go/issues/37360#issuecomment-614060591, which I mentally transposed into this issue. They may be related, in that they are both weird exit statuses on Windows.

aclements commented 4 years ago

For lack of a better idea, I'm now running git bisect start da8591b61c141ca58e4b3eae382d2827006344fd go1.14.3. My reproduction steps are

VM=$(gomote create windows-amd64-longtest)
gomote push $VM
timeout 45m gomote run -path '$PATH,$WORKDIR/go/bin' -e GO_TEST_SHORT=0 -e GO_TEST_TIMEOUT_SCALE=5 $VM go/src/all.bat

I've wrapped this in a horrible shell script that considers "exit status 3221225477" to be a failure, "ALL TESTS PASSED" to be a success, and anything else to be a flake. And I'm running it across five gomotes so it doesn't take an eternity and a half.

aclements commented 4 years ago

Almost there. Looking at the remaining commits, my guess is it's 4ec5e7c50fb2d7fb158578e620e026779c3aafe3, but I'll see the bisect through and do some (even longer!) verification runs.

For the record, here's my current bisect log:

# bad: [da8591b61c141ca58e4b3eae382d2827006344fd] all: remove darwin/arm build-tags and files
# good: [f296b7a6f045325a230f77e9bda1470b1270f817] [release-branch.go1.14] go1.14.3
git bisect start 'da8591b61c141ca58e4b3eae382d2827006344fd' 'go1.14.3'
# good: [1cd724acb6304d30d8998d14a5469fbab24dd3b1] doc/go1.14: highlight the addition of hash/maphash package
git bisect good 1cd724acb6304d30d8998d14a5469fbab24dd3b1
# good: [14d20dc4f38dda548f6991cc687bc71b250b1603] cmd/go: replace TestCgoDependsOnSyscall with a simpler script test
git bisect good 14d20dc4f38dda548f6991cc687bc71b250b1603
# good: [9baafabac9a84813a336f068862207d2bb06d255] crypto/rsa: refactor RSA-PSS signing and verification
git bisect good 9baafabac9a84813a336f068862207d2bb06d255
# good: [0e06775eebf6d357f604a55c803aab5cc8054e67] [dev.link] cmd/link: don't truncate external relocation type
git bisect good 0e06775eebf6d357f604a55c803aab5cc8054e67
# good: [1cb582fe026d22cc886634f49ec0be63bcca911f] [dev.link] cmd/link: fix end-of-block padding (again)
git bisect good 1cb582fe026d22cc886634f49ec0be63bcca911f
# good: [1cb582fe026d22cc886634f49ec0be63bcca911f] [dev.link] cmd/link: fix end-of-block padding (again)
git bisect good 1cb582fe026d22cc886634f49ec0be63bcca911f
# bad: [fff7509d472778cae5e652dbe2479929c666c24f] cmd/compile: add intrinsic HasCPUFeature for checking cpu features
git bisect bad fff7509d472778cae5e652dbe2479929c666c24f
# bad: [888a0c8ef6afb752aafd147eda40d62796d87cb3] testing: add TB.TempDir
git bisect bad 888a0c8ef6afb752aafd147eda40d62796d87cb3
# bad: [cc3039f818a372274180834385718d2dfccb1a26] [dev.link] cmd/internal/goobj2: remove old-style accessors
git bisect bad cc3039f818a372274180834385718d2dfccb1a26
zx2c4 commented 4 years ago

That's an interesting theory...

-   fmap, err := syscall.CreateFileMapping(syscall.Handle(out.f.Fd()), nil, syscall.PAGE_READONLY, high, low, nil)
+   fmap, err := syscall.CreateFileMapping(syscall.Handle(out.f.Fd()), nil, syscall.PAGE_READWRITE, high, low, nil)

I wouldn't expect access violations when adding more permissions to a mapping...if we were still partying like it's 1999. But with DEP, we could be hitting that if we're trying to execute code out of those pages. I haven't looked to see what the linker is using this area for at all, but it might be worth trying to change that to PAGE_EXECUTE_READWRITE, and see if that makes a difference. If it does, then we can go back and revisit the situation in general.

aclements commented 4 years ago

I don't recall the details, but I believe the mapping actually never worked before.that change and we would silently fall back to reading the object file into the heap. So it's not actually that we added more permissions, but that we went from not using a mapping to using a mapping. @cherrymui, do you recall?

cherrymui commented 4 years ago

I don't recall the details, but I believe the mapping actually never worked before

Yes, exactly. Before that CL the mapping actually failed and we just fall back to file I/O.

Actually I suspect this CL as well. Yesterday I spent some time re-reading the source code and some documents on MSDN, but couldn't find anything really wrong. Since I don't know how to reliably test this, I didn't try it myself. @aclements, if you can reliably test this, you could try disabling mmap on windows and see.

One place that looks suspicious to me is that we close the file mapping handle while still using the mapping, https://tip.golang.org/src/cmd/link/internal/ld/outbuf_windows.go#L24 . But reading the documents it's not clear to me whether there is any problem.

cherrymui commented 4 years ago

@zx2c4 this is the linker writing the output binary. It doesn't execute anything in the mapping. (But likely a subsequent command will, in a separate process.)

gopherbot commented 4 years ago

Change https://golang.org/cl/235637 mentions this issue: cmd/link: disable mmap on Windows for now

gopherbot commented 4 years ago

Change https://golang.org/cl/235638 mentions this issue: cmd/link: close file mapping handle after unmapping

cherrymui commented 4 years ago

@aclements if you have a way to test this reliably, does either of the CLs above help? Thanks.

gopherbot commented 4 years ago

Change https://golang.org/cl/235639 mentions this issue: cmd/link: flush file mapping before unmapping

cherrymui commented 4 years ago

Also, does flushing the mapping help? Thanks...

aclements commented 4 years ago

Bisect is done and implicates ff29f97437765abb4dd47a239c3cca274175d5e0, which is one after the suspected cause of 4ec5e7c50fb2d7fb158578e620e026779c3aafe3. I ran 4ec5e7c50fb2d7fb158578e620e026779c3aafe3 65 times with no failures, so either it's not triggering for some unclear reason, or the actual bug is from a change in the object decoder that caused it to go off the end of the mapping.

I'll try running with those two CLs. If it is actually a bounds bug in the decoder, that may mask it, or turn it into a Go bounds panic.

Here's my full bisect log:

# bad: [da8591b61c141ca58e4b3eae382d2827006344fd] all: remove darwin/arm build-tags and files
# good: [f296b7a6f045325a230f77e9bda1470b1270f817] [release-branch.go1.14] go1.14.3
git bisect start 'da8591b61c141ca58e4b3eae382d2827006344fd' 'go1.14.3'
# good: [1cd724acb6304d30d8998d14a5469fbab24dd3b1] doc/go1.14: highlight the addition of hash/maphash package
git bisect good 1cd724acb6304d30d8998d14a5469fbab24dd3b1
# good: [14d20dc4f38dda548f6991cc687bc71b250b1603] cmd/go: replace TestCgoDependsOnSyscall with a simpler script test
git bisect good 14d20dc4f38dda548f6991cc687bc71b250b1603
# good: [9baafabac9a84813a336f068862207d2bb06d255] crypto/rsa: refactor RSA-PSS signing and verification
git bisect good 9baafabac9a84813a336f068862207d2bb06d255
# good: [0e06775eebf6d357f604a55c803aab5cc8054e67] [dev.link] cmd/link: don't truncate external relocation type
git bisect good 0e06775eebf6d357f604a55c803aab5cc8054e67
# good: [1cb582fe026d22cc886634f49ec0be63bcca911f] [dev.link] cmd/link: fix end-of-block padding (again)
git bisect good 1cb582fe026d22cc886634f49ec0be63bcca911f
# good: [1cb582fe026d22cc886634f49ec0be63bcca911f] [dev.link] cmd/link: fix end-of-block padding (again)
git bisect good 1cb582fe026d22cc886634f49ec0be63bcca911f
# bad: [fff7509d472778cae5e652dbe2479929c666c24f] cmd/compile: add intrinsic HasCPUFeature for checking cpu features
git bisect bad fff7509d472778cae5e652dbe2479929c666c24f
# bad: [888a0c8ef6afb752aafd147eda40d62796d87cb3] testing: add TB.TempDir
git bisect bad 888a0c8ef6afb752aafd147eda40d62796d87cb3
# bad: [cc3039f818a372274180834385718d2dfccb1a26] [dev.link] cmd/internal/goobj2: remove old-style accessors
git bisect bad cc3039f818a372274180834385718d2dfccb1a26
# bad: [6dab0942a43039b8dea365ba8629cdc519cfb143] [dev.link] cmd/internal/goobj2, cmd/link: add and use new Sym accessors
git bisect bad 6dab0942a43039b8dea365ba8629cdc519cfb143
# bad: [ff29f97437765abb4dd47a239c3cca274175d5e0] [dev.link] cmd/internal/goobj2: change StringRef encoding for better locality
git bisect bad ff29f97437765abb4dd47a239c3cca274175d5e0
# good: [4ec5e7c50fb2d7fb158578e620e026779c3aafe3] [dev.link] cmd/link: fix mmapping OutBuf on windows
git bisect good 4ec5e7c50fb2d7fb158578e620e026779c3aafe3
# first bad commit: [ff29f97437765abb4dd47a239c3cca274175d5e0] [dev.link] cmd/internal/goobj2: change StringRef encoding for better locality
cherrymui commented 4 years ago

That's interesting. If it is some bounds bug in the decoder, I would expect the failure be more deterministic.

bcmills commented 4 years ago

@cherrymui, maybe this is another one of those page-alignment bugs, where the failure only triggers if the bound happens to be aligned to a page boundary?

cherrymui commented 4 years ago

Do we know which process is failing? It typically fails at "go run" or "go test". It could be the go command, the compiler, the linker, or the generated executable that fails. If it is some bounds bug, I guess the linker will fail. If there is something weird with the file mapping, I guess it could be the linker or the executable.

cherrymui commented 4 years ago

Given that it typically fails when there are a large number of child processes running in parallel, my guess is that it is more likely the generated executable.

ianlancetaylor commented 4 years ago

It fails when there are lot of tests running in parallel, where each test involves building a Go program and then in some cases running it. As far as I can see this could be anywhere in the toolchain: cmd/go, cmd/compile, cmd/link, the final executable.

bcmills commented 4 years ago

I don't think the failure messages are consistent with the final test executable failing. If that were the case, cmd/go would typically print a final FAIL line:

$ cat > main_test.go
package main

import (
        "os"
        "testing"
)

func TestExitNonzero(t *testing.T) {
        os.Exit(0xC0000005)
}

$ go test ./main_test.go
exit status 5
FAIL    command-line-arguments  0.019s
FAIL
aclements commented 4 years ago

CL 235637 (disable mmap on Windows for now) does fix the issue. I was unable to reproduce after 35 runs.

CL 235638 (close file mapping handle after unmapping) didn't help. I was still able to reproduce the issue.

ianlancetaylor commented 4 years ago

Might be worth trying calling FlushViewOfFile to see if that makes a difference.

aclements commented 4 years ago

That's CL 235639, which I'm running right now.

zx2c4 commented 4 years ago

Might be worth trying calling FlushViewOfFile to see if that makes a difference.

It will be interesting if that works. MapViewOfFile isn't consistent with the ordinary read/write calls for file I/O, until you call FlushViewOfFile (although I would expect for unmap to internally do a flush). If this is a case in which the linker needs to call that before the binary is subsequently executed, then it would indicate that NT's PE loader is also using a file loading mechanism that becomes inconsistent with MapViewOfFile until FlushViewOfFile is called. That would surprise me, because executable image loaders since forever have just mmap'd the various sections of the binary, sharing pages across instances. On the other hand, there's also Windows file locking, which I could imagine interacts strangely for some edge cases.

Either way, anxiously awaiting your results, @aclements .

alexbrainman commented 4 years ago

Do we know which process is failing?

I don't think we know.

We can see "exit status 0xC0000005" in the error output. The message is parent process reporting child process failure. I wonder, if we could add debug code to the parent process to print more information there. Maybe we could include command name and argument list there. Extend whatever os package structures are required to print that information in exit string. Just for debugging this.

Sorry I don't have time to try it now.

Alex

aclements commented 4 years ago

CL 235639 (close file mapping handle after unmapping + flush file mapping before unmapping) did not fail after 35 runs. I'm starting a second run to give it more bake time.

For my reference, the command I'm using:

gopool2 create -setup 'gomote push $VM' windows-amd64-longtest 
stress2 -timeout 45m -p 7 -max-runs 35 -max-total-runs 100 -pass "ALL TESTS PASSED" -fail "exit status 3221225477" gopool2 run 'gomote run -e GO_TEST_SHORT=0 -e GO_TEST_TIMEOUT_SCALE=5 $VM go/src/all.bat'
networkimprov commented 4 years ago

cc @jstarks for any insights...

aclements commented 4 years ago

32 more runs of CL 235639 (close file mapping handle after unmapping + flush file mapping before unmapping) with no failures (then my gomote pool died).

I would certainly like to understand this better, but for now I think we can land this CL to keep the beta on track. I'm going to do a run with just CL 235639 (flushing without the late close) to isolate things a little better.

cherrymui commented 4 years ago

Thanks @aclements ! Rebased CL 235639 so it is not on top of the other CL.