runtime: "go test -short -cpu=1,2,4" command takes too long

alexbrainman commented 8 years ago

Build fails sometime http://build.golang.org/log/0f5d58c2d0ba874ec4e387ad5a6cc9c6972b6aa0

##### GOMAXPROCS=2 runtime -cpu=1,2,4
panic: test timed out after 10m0s

goroutine 208730 [running]:
testing.startAlarm.func1()
    C:/workdir/go/src/testing/testing.go:717 +0x139
created by time.goFunc
    C:/workdir/go/src/time/sleep.go:129 +0x41
FAIL    runtime 600.049s
2015/12/16 20:45:49 Failed: exit status 1

Alex

bradfitz commented 8 years ago

I think @aclements is on this.

aclements commented 8 years ago

Possible dup of #13642.

alexbrainman commented 8 years ago

Also maybe "go test -short -cpu=1,2,4" DOES take too long. Maybe we should create a CL (with some failure in it), that runs "go test -short -cpu=1,2,4 -v" instead of "go test -short -cpu=1,2,4" to see if some runtime test is too long.

Alex

aclements commented 8 years ago

Sometimes is does take a long time. Just using the failure logs we already have, here are the last 50 (successful) times that are over 120s. The worst amd64 time is only 145s, though, which suggests the 10 minute timeout is a freeze.

Build	runtime -cpu=1,2,4 time
2015-11-16T22:07:41-4d39bb6/linux-ppc64le-canonical	128.058s
2015-11-17T02:20:02-095710b/linux-arm-arm5	417.964s
2015-11-17T02:20:12-2cfde41/linux-arm-arm5	416.192s
2015-11-17T02:20:21-a897684/linux-arm-arm5	417.765s
2015-11-17T02:20:29-6a18122/linux-arm-arm5	411.677s
2015-11-17T02:20:49-36fe6f2/linux-arm-arm5	410.979s
2015-11-17T02:24:49-a8e839b/linux-arm-arm5	408.966s
2015-11-17T02:25:04-e493309/linux-arm-arm5	414.993s
2015-11-17T02:25:11-a10b270/linux-arm-arm5	415.182s
2015-11-18T19:00:45-6bf794a/plan9-386-ducolombier	147.368s
2015-11-20T19:53:41-2a6c773/linux-arm-arm5	418.586s
2015-11-20T19:55:27-7ab4cba/linux-arm-arm5	411.762s
2015-11-20T19:55:36-08ea825/linux-arm-arm5	416.970s
2015-11-23T01:13:04-29a22ab/linux-arm-arm5	423.353s
2015-11-23T01:13:14-0c64370/linux-arm-arm5	410.002s
2015-11-23T01:13:20-aca4fa5/linux-arm-arm5	416.488s
2015-11-23T02:01:20-61e1cae/linux-arm-arm5	413.792s
2015-11-23T02:01:28-2246923/freebsd-amd64-gce101	145.331s
2015-11-23T02:01:28-2246923/linux-arm-arm5	411.732s
2015-11-23T03:10:30-2876020/linux-arm-arm5	411.326s
2015-11-23T04:38:50-bebbc62/linux-arm-arm5	412.200s
2015-11-23T15:16:54-3facf7e/linux-arm-arm5	403.954s
2015-11-24T01:33:14-9f9b95b/linux-arm-arm5	417.509s
2015-11-24T02:18:46-6b26788/linux-arm-arm5	414.911s
2015-11-24T04:33:05-8ad1dc4/linux-arm-arm5	428.638s
2015-11-25T00:46:57-3583a44/linux-ppc64le-canonical	134.865s
2015-12-03T00:18:53-377a8a8/linux-arm-arm5	410.395s
2015-12-03T00:31:00-3cca924/linux-arm-arm5	410.955s
2015-12-03T00:44:15-40cbf58/linux-arm-arm5	421.949s
2015-12-04T01:00:32-5ff82f8/linux-arm-arm5	426.318s
2015-12-05T04:55:02-f25f6ea/freebsd-amd64-gce101	133.447s
2015-12-05T04:55:02-f25f6ea/linux-arm	276.154s
2015-12-05T04:55:02-f25f6ea/linux-arm-arm5	370.523s
2015-12-05T04:55:02-f25f6ea/linux-mips64le-cherry	468.997s
2015-12-05T04:55:02-f25f6ea/linux-ppc64le-canonical	122.623s
2015-12-05T05:46:48-336c998/linux-arm	282.040s
2015-12-05T05:46:48-336c998/linux-mips64le-cherry	528.227s
2015-12-05T05:46:48-336c998/linux-ppc64le-canonical	131.049s
2015-12-05T06:19:47-d2ca451/freebsd-amd64-gce101	126.806s
2015-12-05T06:19:47-d2ca451/linux-arm	268.973s
2015-12-05T06:19:47-d2ca451/linux-arm-arm5	371.975s
2015-12-05T06:19:47-d2ca451/linux-mips64le-cherry	528.252s
2015-12-05T06:19:47-d2ca451/linux-mips64-minux	839.060s
2015-12-05T06:26:17-5b1a8a5/linux-arm	275.702s
2015-12-05T06:26:17-5b1a8a5/linux-arm-arm5	374.833s
2015-12-05T06:26:17-5b1a8a5/linux-mips64le-cherry	528.136s
2015-12-05T06:26:17-5b1a8a5/linux-mips64-minux	834.156s
2015-12-05T06:26:17-5b1a8a5/linux-ppc64le-canonical	132.242s
2015-12-07T22:07:12-e89d3b4/linux-arm-arm5	414.790s
2015-12-16T05:42:40-86f1944/openbsd-386-gce56	123.219s

alexbrainman commented 8 years ago

That is fine, @aclements. I will let you deal with this.

Alex

rsc commented 8 years ago

~~Let's assume dup of #13642. If you want to try to reproduce with GOTRACEBACK=crash and get full output, that might be helpful. Post it there. Thanks.~~

Edit: #13642 looks like operator error (mine). Since this CL has more reliable data, let's move back to this one.

mwhudson commented 8 years ago

I've seen this recently too and reported #13632. And I think I saw it on a builder today -- so I think there is a problem here.

aclements commented 8 years ago

Some recent samples from try bots:

net/http: document ResponseWriter and Handler more; add test

Patch Set 1:

This change failed on windows-386-gce:
See https://storage.googleapis.com/go-build-log/ec63595c/windows-386-gce_8a484fbf.log

net: fix the build even harder

Patch Set 1:

This change failed on linux-amd64-race:
See https://storage.googleapis.com/go-build-log/f0e334d2/linux-amd64-race_7bc08295.log

alexbrainman commented 8 years ago

Another crash, but with more stack trace: cmd/dist: skip all nolibgcc tests on windows

Patch Set 1: TryBot-Result-1

1 of 15 TryBots failed: Failed on windows-amd64-gce: https://storage.googleapis.com/go-build-log/6f64c8df/windows-amd64-gce_7c21e23b.log

Alex

aclements commented 8 years ago

I was hoping this would be fixed by 8d5ff2e, which fixed a bug in the StackOverflow test, but it's still happening.

alexbrainman commented 8 years ago

I don't see original crash anymore. But there is new failure happening http://build.golang.org/log/9097c870bd4716e62c87a18648e0bd9da7061abc

panic: test timed out after 6m0s

goroutine 103795 [running]:
testing.startAlarm.func1()
    C:/workdir/go/src/testing/testing.go:719 +0x152
created by time.goFunc
    C:/workdir/go/src/time/sleep.go:129 +0x41

goroutine 1 [chan receive, 5 minutes]:
testing.RunTests(0x6f4fb0, 0x7d4300, 0xb3, 0xb3, 0x7bbf01)
    C:/workdir/go/src/testing/testing.go:577 +0x8c0
testing.(*M).Run(0xc082081ef8, 0x6f5638)
    C:/workdir/go/src/testing/testing.go:509 +0x88
runtime_test.TestMain(0xc082081ef8)
    C:/workdir/go/src/runtime/crash_test.go:24 +0x2f
main.main()
    runtime/_test/_testmain.go:956 +0x11b

goroutine 102753 [chan receive, 5 minutes]:
testing.(*T).Parallel(0xc08204ea20)
    C:/workdir/go/src/testing/testing.go:435 +0xc4
runtime_test.TestStackGrowth(0xc08204ea20)
    C:/workdir/go/src/runtime/stack_test.go:74 +0x2c
testing.tRunner(0xc08204ea20, 0x7d5008)
    C:/workdir/go/src/testing/testing.go:467 +0x9f
created by testing.RunTests
    C:/workdir/go/src/testing/testing.go:576 +0x880

goroutine 102755 [chan receive, 5 minutes]:
testing.(*T).Parallel(0xc08204eab0)
    C:/workdir/go/src/testing/testing.go:435 +0xc4
runtime_test.TestStackGrowthCallback(0xc08204eab0)
    C:/workdir/go/src/runtime/stack_test.go:157 +0x2c
testing.tRunner(0xc08204eab0, 0x7d5020)
    C:/workdir/go/src/testing/testing.go:467 +0x9f
created by testing.RunTests
    C:/workdir/go/src/testing/testing.go:576 +0x880

goroutine 102754 [chan receive, 5 minutes]:
testing.RunTests.func1(0xc0820282a0, 0xc08204ea20)
    C:/workdir/go/src/testing/testing.go:580 +0x4e
created by testing.RunTests
    C:/workdir/go/src/testing/testing.go:581 +0x921

goroutine 103793 [runnable, locked to thread]:
syscall.Syscall(0x76e75e28, 0x2, 0x464045, 0x6f6118, 0x0, 0x4110f9, 0x10, 0x660240)
    C:/workdir/go/src/runtime/syscall_windows.go:128 +0x5c
syscall.(*Proc).Call(0xc0820ea080, 0xc0820e1e80, 0x2, 0x2, 0xc0820e1e98, 0x2, 0x0, 0x0)
    C:/workdir/go/src/syscall/dll_windows.go:132 +0x391
runtime_test.nestedCall(0xc08204e1b0, 0x6f6118)
    C:/workdir/go/src/runtime/syscall_windows_test.go:166 +0x144
runtime_test.TestCallbackPanic(0xc08204e1b0)
    C:/workdir/go/src/runtime/syscall_windows_test.go:221 +0x14e
runtime_test.TestCallbackPanicLoop(0xc08204e1b0)
    C:/workdir/go/src/runtime/syscall_windows_test.go:228 +0x37
testing.tRunner(0xc08204e1b0, 0x7d52c0)
    C:/workdir/go/src/testing/testing.go:467 +0x9f
created by testing.RunTests
    C:/workdir/go/src/testing/testing.go:576 +0x880

goroutine 102756 [chan receive, 5 minutes]:
testing.RunTests.func1(0xc0820282a0, 0xc08204eab0)
    C:/workdir/go/src/testing/testing.go:580 +0x4e
created by testing.RunTests
    C:/workdir/go/src/testing/testing.go:581 +0x921
FAIL    runtime 360.077s
2016/01/04 19:34:39 Failed: exit status 1

that looks similar.

Alex

aclements commented 8 years ago

I could be wrong, but my current hunch is that the runtime test is a red herring. I think t.Parallel() may be broken. There are only three places where we use it in the standard library: runtime, net, and net/http. Looking through recent failures, these are the packages we see timing out, and they time out with the tests marked t.Parallel() waiting on the channel that tells them to go.

aclements commented 8 years ago

2016-01-13T14:02:22-ceeb52d/linux-arm64-linaro: TestStackCache is stuck (non-Parallel test). Lots of runnable goroutines at the beginning of runtime_test.growing. Maybe something wrong with stack growth? TestFutexsleep is done, but two goroutines spawned from it are still running (presumably in the futex sleep). TestStackGrowth and TestStackGrowthCallback are waiting for the parallel phase.
2016-01-13T14:02:22-ceeb52d/linux-arm64-canonical: RunTests is in fmtDuration -> fmt.Sprinf -> prologue of fmt.(*pp).doPrintf in state runnable. Maybe also stuck growing the stack? TestFutexsleep, TestStackGrowth and TestStackGrowthCallback are identical to failure 1.
2016-01-13T04:49:19-4ffba76/linux-amd64-nocgo: Regular runtime test. No TestFutexsleep (the test would have been skipped). Otherwise identical to failure 2.
2016-01-12T17:31:01-505fa7b/linux-ppc64le-canonical: RunTests is in fmtDuration in "GC assist wait". Otherwise identical to failure 3.
2016-01-12T02:00:48-352e287/linux-amd64-sid: Identical to failure 1.
2016-01-12T01:56:04-1abb863/linux-amd64-clang: Identical to failure 4.

There's also a recent timeout not in the runtime test. It might be unrelated: 2016-01-12T02:00:48-352e287/linux-amd64-race: Timeout in net/http. No obvious culprit; everybody is waiting on everyone else.

aclements commented 8 years ago

The leftover Futexsleep goroutines are unrelated and should be fixed by https://go-review.googlesource.com/18617.

alexbrainman commented 8 years ago

I cannot reproduce failures we see on windows builders here on my computers. I run selected runtime tests in a loop. If you have suggestions on how to break things here, I might help more. Maybe tiny sleep here or there?

Alex

aclements commented 8 years ago

I wish I did. I spent all day trying to reproduce this on a linux-amd64-nocgo gomote and somehow haven't succeeded. It shows up frequently enough on the dashboard for this builder that I'm rather perplexed.

aclements commented 8 years ago

I ran a go test -short std loop all night on a linux-amd64-nocgo gomote and still no luck reproducing. I'm going to try lifting the loop level to create and tear down the VM itself. Maybe this has something to do with where the VM lands.

VM=$(gomote create linux-amd64-nocgo)
gomote push $VM
gomote run $VM 'go/src/make.bash'
gomote run $VM /bin/sh -c 'PATH=$PWD/go/bin:$PATH; cd go/src; while true; do GOTRACEBACK=crash go test -short std; done'

aclements commented 8 years ago

I've now run over 200 cycles of create VM/go test std/destroy VM and still haven't reproduced this.

aclements commented 8 years ago

2016-01-12T02:00:48-352e287/linux-amd64-sid: Lots of runnable goroutines in the prologue of runtime_test.growing 2016-01-12T17:31:01-505fa7b/linux-ppc64le-canonical: GC assist wait in testing.fmtDuration:256 +0x108 (call to runtime.convT2E with a nil x, so it allocates) 2016-01-13T04:49:19-4ffba76/linux-amd64-nocgo: Runnable in fmt.(pp).doPrintf prologue 2016-01-13T14:02:22-ceeb52d/linux-arm64-canonical: Runnable in fmt.(pp).doPrintf prologue 2016-01-13T14:02:22-ceeb52d/linux-arm64-linaro: Lots of runnable in runtime_test.growing prologue 2016-01-13T18:22:21-b2cf5e7/dragonfly-amd64: Runnable in runtime_test.growStackIter progloue and runtime_test.TestStackGrowth.func3:111 +0x287 (call to runtime.selectgo, which can allocate sudogs) 2016-01-13T18:39:15-9094e3a/linux-amd64-sid: Runnable in regexp.(*Regexp).doExecute prologue 2016-01-13T19:06:28-dd6753a/windows-amd64-gce: Runnable in syscall.loadlibrary:98 +0x66 (call to runtime.cgocall) 2016-01-13T19:06:40-4a0eee2/linux-amd64-nocgo: Lots of runnable in runtime_test.growing prologue 2016-01-13T19:06:40-4a0eee2/linux-ppc64le-canonical: Runnable in testing.fmtDuration:256 +0x108 (call to runtime.convT2E with a nil x, so it allocates) 2016-01-13T19:06:40-4a0eee2/windows-386-gce: Runnable in syscall.Syscall:128 +0x4a (call to runtime.cgocall) 2016-01-13T19:10:31-00c0a3a/linux-amd64: Lots of runnable in runtime_test.growing prologue 2016-01-13T23:07:40-8c9ef9d/windows-amd64-gce: Runnable in syscall.Syscall:128 +0x5c (call to runtime.cgocall) 2016-01-14T02:04:50-5f23bc8/windows-amd64-gce: Runnable in syscall.Syscall:128 +0x5c (call to runtime.cgocall) 2016-01-14T02:05:31-535741a/windows-386-gce: Runnable in syscall.Syscall:128 +0x4a (call to runtime.cgocall) 2016-01-14T03:08:30-353ee32/windows-amd64-gce: Runnable in syscall.getprocaddress:113 +0x66 (call to runtime.cgocall) 2016-01-14T03:16:20-b73d8fb/linux-amd64-noopt: Lots of runnable in runtime_test.growing prologue 2016-01-14T04:01:47-f36ee8c/linux-amd64-noopt: Runnable in fmt.glob.func1:128 +0x2f (call to runtime.newobject). 2016-01-14T15:23:41-38ca046/linux-arm64-linaro: GC assist wait in fmt.glob.func1:128 +0x2c (call to runtime.newobject)

I'm not sure what's happening on Windows, but on the other platforms every single timeout has a goroutine in state "running" that's in a context where it could be allocating, or it clearly is trying to allocate and is blocked in an assist.

gopherbot commented 8 years ago

CL https://golang.org/cl/18648 mentions this issue.

alexbrainman commented 8 years ago

I'm not sure what's happening on Windows, ...

Do not forget that all windows failures we see are around Windows callbacks. There is a lot of unsafe (and asm) code there. I wouldn't be surprised if some recent runtime changes affected that.

Alex

aclements commented 8 years ago

I looked at Windows a little more closely and those failures are also in a context that might be allocating (because of the defer statement in cgocall).

Hopefully we'll have much more information once a few more failures show up on the dashboard with 1556c31 applied.

aclements commented 8 years ago

I managed to reproduce this by repeatedly running the trybots on https://go-review.googlesource.com/#/c/18646/ and https://go-review.googlesource.com/#/c/18680/. These show runtime frames. Unfortunately, for reasons I don't understand, this didn't traceback at "crash" level, so some tracebacks are still hidden.

Some interesting cases:

https://storage.googleapis.com/go-build-log/390e0725/windows-amd64-gce_4122b0f2.log: Shows a cgocall mallocing and assisting and getting stuck in the assist. Interestingly, the goroutine is shown as "runnable", not "GC assist wait", which indicates we woke the goroutine but it never actually ran again. My best guess is that there's an issue with us putting that goroutine on the global run queue and never pulling it off the global run queue. However, the background sweeper is also runnable and in Gosched, which indicates a few things: GC is definitely over, at least some scheduling happened after GC started the world again, and there's almost certainly a G on a local run queue (edit: this is not correct; Gosched puts it on the global run queue). However, it also looks like the system has been wedged for at least five minutes, which is more than enough time for the background sweeper to finish, suggesting that the scheduler lost track of both the background sweeper and the assist. The scheduler isn't completely wedged, though, because it ran the timeout goroutine.

https://storage.googleapis.com/go-build-log/763ec68b/windows-amd64-gce_7689e79d.log: A runtime test timeout, but probably unrelated to the GC problem since GC appears to be completely idle and there are no assists running.

https://storage.googleapis.com/go-build-log/7647f2db/windows-amd64-gce_e5e4abda.log: Same.

https://storage.googleapis.com/go-build-log/7d903bfa/linux-amd64_d3fa644c.log: User goroutine is stuck in the GC assist wait park. There's a running background mark worker, which, unfortunately, didn't show a traceback. The stuck mark worker is presumably why the system isn't making progress.

I've started the same cycle on https://go-review.googlesource.com/18681 but with the traceback level set more aggressively to crash. I'll probably add some scheduler debugging to it.

gopherbot commented 8 years ago

CL https://golang.org/cl/18682 mentions this issue.

aclements commented 8 years ago

Got a reproduction with GOTRACEBACK=crash (but unfortunately not the scheduler debug dump I added because it was killed by the infrastructure, not from within): https://storage.googleapis.com/go-build-log/f7c7c056/darwin-amd64-10_10_795d2e74.log

Goroutine 53 is parked in a GC assist wait and has been for at least 3 minutes. Goroutines 67 and 74 are actively running mark workers. 74 is on M 8 trying to stop the world but is stuck in the loop waiting for other Ms to voluntarily stop. 67 is presumably the one preventing it. Unfortunately, 67 doesn't obviously appear in any of the other thread tracebacks, which is confusing because all of the other threads are clearly at points where they would have switched their Gs out of _Grunning.

aclements commented 8 years ago

Another failure on the dashboard, now with "system" level tracebacks for runtime failures: 2016-01-16T02:40:00-b05f18e/linux-ppc64-minux

None of the goroutines are in assists and the background mark workers are idle. Other than the alarm goroutine, only two goroutines are runnable: the background sweeper is in Gosched and the main test thread is in the prologue of fmt.Sprintf. These are both weird. Based on the other goroutines, there almost certainly hasn't been GC activity for 19 minutes, so the background sweeper has had oodles of time to finish. fmt.Sprintf could have stopped either because it needed to grow its stack or because of a preemption. I don't think it needed to grow its stack because the calling frame was already past the first doubling boundary (the stack size in the caller is 0xc820037f80 - 0xc820037910 = 1648 bytes and the first boundary should be 2048 - 720 + 128 = 1456) and it's nowhere near the next boundary (4096-720+128=3504). That means it got preempted, which calls gopreempt_m, which puts it in to _Grunnable (consistent with the traceback) and puts it on the global queue.

Since both goroutines are on the global run queue, this again points at some issue with scheduling things from the global queue.

aclements commented 8 years ago

In nearly every case, one of the goroutines that is runnable or in "GC assist wait" is also "locked to thread". This happens in Linux on the main testing thread because of the init function in runtime_linux_test.go. It happens on Windows in the test that we see freezing (I'm not sure why; I assume because of the callback). It also happened in the Darwin failure because the wedged test, TestTraceStressStartStop, calls LockOSThread.

aclements commented 8 years ago

Finally a timeout with a snapshot of the scheduler state: https://storage.googleapis.com/go-build-log/8ecdfe49/windows-amd64-gce_e20c47a0.log. GC is not running and has not run for over five minutes. Both P run queues are empty (one is idle, the other is running the alarm goroutine). The two runnable goroutines (background sweeper and a cgocall) are on the global run queue and must have been stuck there for several minutes. The stuck cgocall is "locked to thread".

minux commented 8 years ago

The same runtime -cpu=1,2,4 test also occasionally hang on the ppc64 builder.

For example, http://build.golang.org/log/8db41f218ad62ec3ba3b8117b521fda2984b0340

rsc commented 8 years ago

So, @aclements, how can that happen? :-)

A few thoughts, perhaps none of them useful.

(1) It looks to me like there are two different p's both with gcMarkWorkerMode == gcMarkWorkerFractionalMode. That seems wrong: I thought there would be only one worker dedicated to being a fraction at any given time.

It looks like the "we lost a race" here would explain having two fractional workers:

decIfPositive := func(ptr *int64) bool {
    if *ptr > 0 {
        if atomic.Xaddint64(ptr, -1) >= 0 {
            return true
        }
        // We lost a race
        atomic.Xaddint64(ptr, +1)
    }
    return false
}

I don't see any actual harm caused by having two fractional workers in the code. Eventually one will die and not be replaced (until the next race). But still it seems scary and could be avoided with a Cas if we understood how it explained real damage.

(2) gcphase is 0. Why are there any fractional workers in that case? The GC that kicked off the fractional workers must have finished to get to gcphase 0. How did it finish with the fractional workers still thinking they should run?

(3) I see assignments that set p.gcMarkWorkerMode but nothing that clears it. Also gcMarkWorkerDedicatedMode == 0, which is a little odd for distinguishing "not a worker" from "a worker". But maybe code only looks at that field if it knows it is a worker, in which case the two p's with fractional mode could both be stale, with neither actually being a worker.

(4) All cgocalls lock the goroutine to the thread for the duration of the call, in case of cgo callbacks. When a goroutine is locked to a thread, that thread bails out of func schedule very early. It does not even reach the code thinking about being a GC worker. It's possible that the thread ran a GC worker earlier, before being locked, but that worker is preempted if we're doing the cgo call. And then the worker is attached to the p, which will be released elsewhere while the cgo call is blocked. And the GC finished.

(5) The stack trace doesn't show us where the cgo call goroutine is blocked. It will show the entersyscall information until gp.syscallsp is cleared after the goroutine is cleared to run. But I don't see any way it can block during entersyscall, so it must be blocked in exitsyscall.

(6) System stacks would be nice. Since this seems to happen mainly on Windows, I wonder if we can adapt os1_windows.go's profileloop1/profilem into something that will give us the same thread stacks we now can get on Unix.

aclements commented 8 years ago

For 1, 2, and 3, you're right that it wouldn't be intentional for multiple fractional workers to be running at the same time, but we never clear gcMarkWorkerMode, so these are just stale values, gcMarkWorkerMode is a one time communication from the scheduler at wakeup of what type of worker a worker should be. We didn't even have to lose the decIfPositive race. One fractional worker could have parked before we started the other one.

4 is interesting. I strongly suspect locked goroutines are part of this. I hadn't thought about how that skips specifically the worker scheduling code. OTOH, in https://storage.googleapis.com/go-build-log/8ecdfe49/windows-amd64-gce_e20c47a0.log the workers are parked in their usual spot and GC isn't active, so it shouldn't have scheduled any workers anyway.

rsc commented 8 years ago

The locking of goroutines concern threads, not p's, so the fact that it skips the worker scheduling code shouldn't really matter. The locked thread (m) will park itself and some other m will get to run. Not that this helps.

aclements commented 8 years ago

I finally managed to reproduce this outside of the build system! Last night I got 3 failures out of 187 runs on the windows-386-gce builder using the following script

#!/bin/zsh

VM=$(gomote create windows-386-gce)
if [[ -z $VM ]]; then
    echo "VM creation failed"
    exit 0
fi
STATUS=0
pushlog=$(mktemp)
if ! gomote push $VM >& $pushlog; then
    echo "Push failed:"
    cat $pushlog
    STATUS=1
elif ! gomote run $VM go/src/all.bat; then
    echo "Run failed"
    STATUS=1
fi
rm $pushlog
gomote destroy $VM
exit $STATUS

run as parallel -j 10 ./test1 :::: <(seq 200) with the following modification to all.bat to run only the parallel runtime test:

diff --git a/src/run.bat b/src/run.bat
index 01a66bc..2f4ab00 100644
--- a/src/run.bat
+++ b/src/run.bat
@@ -37,7 +37,7 @@ call env.bat
 del env.bat
 echo.

-go tool dist test
+go tool dist test runtime:cpu124
 if errorlevel 1 goto fail
 echo.

Next steps are to narrow this down a bit and to see if the same approach works on a Linux builder where it's easier to debug things.

aclements commented 8 years ago

Reproduced in 2 out of 230 iterations across 13 windows-386-gce VMs with a loop around go tool dist test:

for /l %%x in (1, 1, 50) do (
    go tool dist test runtime:cpu124
)

aclements commented 8 years ago

Reproduced once out of 135 linux-amd64-noopt VMs with the equivalent Linux change to https://github.com/golang/go/issues/13645#issuecomment-176230977

aclements commented 8 years ago

Reproduced 2 out of 147 iterations across 6 linux-amd64-noopt VMs with the equivalent of https://github.com/golang/go/issues/13645#issuecomment-176285999

aclements commented 8 years ago

I was able to collect some core dumps from this. I took the core dumps without allowing the freezetheworld and various other things that happen in the panic handler and, interestingly, the story they tell is rather different from the time out tracebacks.

Here's a representative example:

G state:
       1 waiting (GC assist wait), m 0 (locked to thread)
         started at runtime.main
         created by runtime.rt0_go
       2 waiting (force gc (idle))
         started at runtime.forcegchelper
         created by runtime.init.4
      17 syscall, m 1 (locked to thread)
         started at 0x0
         created by 0x0
      18 waiting (GC sweep wait)
         started at runtime.bgsweep
         created by runtime.gcenable
      19 waiting (finalizer wait)
         started at runtime.runfinq
         created by runtime.createfing
      20 syscall, m 11
         started at runtime.timerproc
         created by runtime.addtimerLocked
     112 waiting (GC worker (idle))
         started at runtime.gcBgMarkWorker
         created by runtime.gcBgMarkStartWorkers
  207479 waiting (GC worker (idle))
         started at runtime.gcBgMarkWorker
         created by runtime.gcBgMarkStartWorkers
  207598 waiting (chan receive)
         started at testing.tRunner
         created by testing.RunTests
  207599 waiting (chan receive)
         started at testing.RunTests.func1
         created by testing.RunTests
  207600 waiting (chan receive)
         started at testing.tRunner
         created by testing.RunTests
  207601 waiting (chan receive)
         started at testing.RunTests.func1
         created by testing.RunTests

M state:
       0 PID 28287, g 1, blocked, locked to thread
       1 PID 0, g 17, locked to thread
       2 PID 28290, blocked
       3 PID 28291, blocked
       4 PID 28292, blocked
       5 PID 28293, blocked
       6 PID 28294, blocked
       7 PID 28295, blocked
       8 PID 28296, epollwait
       9 PID 28297, blocked
      10 PID 28298, blocked
      11 PID 29459, g 20, blocked
      12 PID 29460, blocked
      13 PID 29461, blocked
      14 PID 29462, blocked

P state:
       0 idle, mark worker g 112
         runq: <empty>
       1 idle, mark worker g 207479
         runq: <empty>

global runq: <empty>

All of the Gs are waiting and all of the Ms are blocked. Notably, the main G is blocked in an assist and no workers are running and everything else is waiting, so nothing schedules a new worker and no worker satisfies the assist.

GC is in mark 2. work.full is empty, but only 2 of 4 mark root jobs have completed, so there is still GC work to be done. In this situation, since there are only two Ps, it's possible that no fractional worker would be scheduled, but I would expect idle workers to keep things going. The lack of idle workers indicates that we somehow failed to satisfy the idle worker start condition on all of the Ps simultaneously just before they all blocked. My best guess is that the workers were detached, but I haven't been able to come up with a sequence that would cause this, and as of the core file all Ps have workers.

A few things worth trying to see if they amplify the chance of this happening: Add delay between detaching and gcMarkDone to encourage the workers to simultaneously detach. Disable fractional and dedicated workers to force everything through idle workers.

For debugging, it may help for the Ps to record some of the state of the world before they go idle, like why they didn't start an idle worker.

aclements commented 8 years ago

The following patch exacerbates the problem with locked Gs by locking all Gs. In theory this should work, but it causes the runtime test to regularly freeze in much the same way as seen in these test failures. (It also causes TestMainGoroutineID to fail for uninteresting reasons.)

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index 2bc3c92..cfb1172 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -2647,6 +2652,8 @@ func newproc(siz int32, fn *funcval) {
        })
 }

+var lockOSThreadFunc uintptr
+
 // Create a new g running fn with narg bytes of arguments starting
 // at argp and returning nret bytes of results.  callerpc is the
 // address of the go statement that created this.  The new g is put
@@ -2703,6 +2710,10 @@ func newproc1(fn *funcval, argp *uint8, narg int32, nret int32, callerpc uintptr
        newg.sched.pc = funcPC(goexit) + sys.PCQuantum // +PCQuantum so that previous instruction is in same function
        newg.sched.g = guintptr(unsafe.Pointer(newg))
        gostartcallfn(&newg.sched, fn)
+       if lockOSThreadFunc == 0 {
+               lockOSThreadFunc = funcPC(LockOSThread)
+       }
+       gostartcallfn(&newg.sched, (*funcval)(unsafe.Pointer(&lockOSThreadFunc)))
        newg.gopc = callerpc
        newg.startpc = fn.fn
        if isSystemGoroutine(newg) {

aclements commented 8 years ago

I've been stressing https://go-review.googlesource.com/19107 for over two hours across 10 VMs and so far none of the 1,242 runs has failed.

gopherbot commented 8 years ago

CL https://golang.org/cl/19107 mentions this issue.

golang / go

runtime: "go test -short -cpu=1,2,4" command takes too long #13645