golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.1k stars 17.68k forks source link

Building go crashes on 144 core/1+TB RAM CentOS7 machine #18551

Closed hartzell closed 7 years ago

hartzell commented 7 years ago

I happen to have access to a large system and am using it to build a variety of packages. My go builds almost always fail. Details below.

What version of Go are you using (go version)?

Bootstrapping go: from here: https://storage.googleapis.com/golang/go1.4-bootstrap-20161024.tar.gz

Go from here: https://storage.googleapis.com/golang/go1.7.4.src.tar.gz

What operating system and processor architecture are you using (go env)?

[hartzelg@rpbuchop001 go]$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH=""
GORACE=""
GOROOT="/isilon/Analysis/scratch/hartzelg/linuxbrew/Cellar/go/1.7.4/libexec"
GOTOOLDIR="/isilon/Analysis/scratch/hartzelg/linuxbrew/Cellar/go/1.7.4/libexec/pkg/tool/linux_amd64"
CC="/isilon/Analysis/scratch/hartzelg/linuxbrew/bin/gcc-5"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build188975734=/tmp/go-build -gno-record-gcc-switches"
CXX="/isilon/Analysis/scratch/hartzelg/linuxbrew/bin/g++-5"
CGO_ENABLED="1"

What did you do?

Building the go recipe in spack like so:

spack install go@1.7.4

(which builds and uses a bootstrapping system from the 1.4 snapshot).

It boils down to '/usr/bin/bash' 'all.bash'.

I'm seeing this in a job running via jenkins that simplifies to this:

git clone https://github.com/llnl/spack
cd spack
export PATH=`pwd`/bin:$PATH
ulimit -u 8192
spack install go@1.7.4

What did you expect to see?

A successful build.

What did you see instead?

A panic during testing. Full output (1500+ lines) in this gist.

Here's the first few lines.

ok      regexp  0.455s
ok      regexp/syntax   1.000s
panic: test timed out after 3m0s

goroutine 22864 [running]:
panic(0x5be600, 0xc42017a010)
    /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/panic.go:500 +0x1a1 fp=0xc42010cf48 sp=0xc42010ceb8
testing.startAlarm.func1()
    /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:918 +0x10b fp=0xc42010cfb0 sp=0xc42010cf48
runtime.goexit()
    /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42010cfb8 sp=0xc42010cfb0
created by time.goFunc
    /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/time/sleep.go:154 +0x44

goroutine 1 [chan receive, 2 minutes, locked to thread]:
runtime.gopark(0x629c68, 0xc420016598, 0x60cadd, 0xc, 0xc420110b17, 0x3)
    /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/proc.go:259 +0x13a fp=0xc420110b10 sp=0xc420110ae0
runtime.goparkunlock(0xc420016598, 0x60cadd, 0xc, 0xc420000117, 0x3)
    /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/proc.go:265 +0x5e fp=0xc420110b50 sp=0xc420110b10
runtime.chanrecv(0x5bc980, 0xc420016540, 0x0, 0xc420110c01, 0x47434c)
    /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/chan.go:496 +0x2df fp=0xc420110bd8 sp=0xc420110b50
runtime.chanrecv1(0x5bc980, 0xc420016540, 0x0)
    /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/chan.go:378 +0x35 fp=0xc420110c10 sp=0xc420110bd8
testing.(*T).Run(0xc42001a180, 0x611a7e, 0x18, 0x62a6c8, 0xc42011fc01)
    /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:647 +0x316 fp=0xc420110cb8 sp=0xc420110c10
testing.RunTests.func1(0xc42001a180)
    /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:793 +0x6d fp=0xc420110d08 sp=0xc420110cb8
testing.tRunner(0xc42001a180, 0xc420110de8)
    /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:610 +0x81 fp=0xc420110d30 sp=0xc420110d08
testing.RunTests(0x6295c0, 0x70f7c0, 0xc4, 0xc4, 0x43bf04)
    /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:799 +0x2f5 fp=0xc420110e18 sp=0xc420110d30
testing.(*M).Run(0xc420110ef8, 0x40f769)
    /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:743 +0x85 fp=0xc420110ea0 sp=0xc420110e18
runtime_test.TestMain(0xc420110ef8)
    /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/crash_test.go:27 +0x2f fp=0xc420110ef0 sp=0xc420110ea0
main.main()
    runtime/_test/_testmain.go:836 +0xc6 fp=0xc420110f58 sp=0xc420110ef0
runtime.main()
    /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/proc.go:183 +0x1f4 fp=0xc420110fb0 sp=0xc420110f58
runtime.goexit()
    /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc420110fb8 sp=0xc420110fb0

goroutine 17 [syscall, 2 minutes, locked to thread]:
runtime.goexit()
    /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42010efb8 sp=0xc42010efb0

Some time later (15 minutes after the test failed, I still have a go-build-related process running:

hartzelg  94605  400  0.0   4060  1740 ?        Rl   14:33  86:51 /tmp/go-build307875912/a.exe

and jenkins thinks that the job is still running.

minux commented 7 years ago

The test that times out is: runtime_test.TestCrashDumpsAllThreads

Do you still have the a.exe running? Is it possible to obtain a backtrace of all running threads using gdb?

bradfitz commented 7 years ago

Are you sure this is related to the machine's size? Does this build fail the same way on a smaller machine with the same software?

hartzell commented 7 years ago

Do you still have the a.exe running? Is it possible to obtain a backtrace of all running threads using gdb?

No, it is not still running. I have something else running for a bit, but can rerun this later and if it ends up in the same state can to grab the backtrace (but below...).

Can you give me instructions on getting the backtrace?

hartzell commented 7 years ago

Are you sure this is related to the machine's size?

Not sure, but it seems related to this box. I was originally building successfully on a Digital Ocean 16GB instance without any trouble. I anecdotally believe that I've successfully built "in-house" on other systems, but since it occasionally succeeds on the big box I'm not really sure.

It's also possible that the box has hardware issues (though I haven't seen any other tools).

I can find/use a smaller box with the same config and see what it does though it'll take a day or two.

minux commented 7 years ago

Find out the pid of the stray process, run gdb -p $PID and at the gdb prompt, run the following:

set logging file /tmp/gdb.log set logging on set height 0 thread apply all bt quit

And then paste the output (in /tmp/gdb.log).

You also try running the test directly with (after building with make.bash): go test runtime -run=TestCrashDumpsAllThreads

Also note that Go 1.8 will be released in a month, so please also give the latest master branch a try.

hartzell commented 7 years ago

The test that times out is: runtime_test.TestCrashDumpsAllThreads

Is it possible that what ever it's doing just takes more than the test timeout (and more than the 15 minutes that passes while I was writing up the issue), perhaps because of the size of the system?

davecheney commented 7 years ago

On non intel systems I'd say yes. But on intel, when this test hangs, adding more time won't help, it's either a real bug or a flakey test.

On Sat, Jan 7, 2017 at 10:30 AM, George Hartzell notifications@github.com wrote:

The test that times out is: runtime_test.TestCrashDumpsAllThreads

Is it possible that what ever it's doing just takes more than the test timeout (and more than the 15 minutes that passes while I was writing up the issue), perhaps because of the size of the system?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/18551#issuecomment-271038511, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcA3G5avH-RZ8_gMFMmp7dyYO4uYISks5rPs6tgaJpZM4LdNRC .

hartzell commented 7 years ago

Between filing the issue and reading @minux and @bradfitz followups, I started a run with GOMAXPROCS=16 in a grasping at straws debugging effort and switched the working directory from an NFS mount to something local (because debugging's more fun if you change multiple things at the same time, sigh...; but seriously I'm also trying to debug some infra. issues).

I'll let it finish (because I'm Curious), swap the working dir back and see if I can catch that "leftover" process running.

bradfitz commented 7 years ago

NFS has been problematic in the past, IIRC.

hartzell commented 7 years ago

NFS has been problematic in the past, IIRC.

That's generally how I feel about it... 🙁

minux commented 7 years ago

The TestCrashDumpsAllThreadstest ( https://golang.org/src/runtime/crash_unix_test.go#L26) shouldn't take much time to run. If it hangs, it's usually some problem in the implementation of crash dump (which involves some signal trickery.)

From the test log, the test timed out while waiting for the process to finish the crash dump: goroutine 22861 [syscall, 2 minutes]: syscall.Syscall6(0xf7, 0x1, 0x1718d, 0xc420474000, 0x1000004, 0x0, 0x0, 0xc42010bb07, 0xc420037500, 0xc42014cd00) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/syscall/asm_linux_amd64.s:44 +0x5 fp=0xc42010baf0 sp=0xc42010bae8 os.(Process).blockUntilWaitable(0xc42000d230, 0xe, 0xc42010bd1f, 0x1) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/os/wait_waitid.go:28 +0xbc fp=0xc42010bb88 sp=0xc42010baf0 os.(Process).wait(0xc42000d230, 0x506471, 0xc42000d244, 0x3) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/os/exec_unix.go:22 +0xab fp=0xc42010bc18 sp=0xc42010bb88 os.(Process).Wait(0xc42000d230, 0x0, 0x3, 0x0) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/os/doc.go:49 +0x2b fp=0xc42010bc48 sp=0xc42010bc18 os/exec.(Cmd).Wait(0xc420190160, 0x6f7f00, 0xc42010bd60) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/os/exec/exec.go:434 +0x6d fp=0xc42010bcd8 sp=0xc42010bc48 runtime_test.TestCrashDumpsAllThreads(0xc42001a240) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/crash_unix_test.go:99 +0x849 fp=0xc42010bf78 sp=0xc42010bcd8 testing.tRunner(0xc42001a240, 0x62a6c8) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:610 +0x81 fp=0xc42010bfa0 sp=0xc42010bf78 runtime.goexit() /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42010bfa8 sp=0xc42010bfa0 created by testing.(*T).Run /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:646 +0x2ec

Therefore the test has already successfully send SIGQUIT to the process, but the process fails to stop, indicating something is wrong about signal handling in the runtime.

hartzell commented 7 years ago

Having gotten the snark out of my system, it's worth noting that even though I've moved the Jenkins working directory, Spack (unless you configure it otherwise) runs its unbundle, build and install steps in a directory tree symlinked into /tmp, which in my case is a 15TB local filesystem.

The change will speed up some other parts of the job (recursively removing the tree takes forever in our NFS environment) but after consideration I don't think it will change the outcome of this issue.

Still, it's a change and it's worth being up front about it.

hartzell commented 7 years ago

[sorry for the delay, been down with a cold...]

Last week I fired off a build w/ GOMAXPROCS=16 before I received the feedback above.

The gist is here.

I think it's different, but I'm guessing....

Unless there's new feedback my next step will be to try to recreate the original crash and catch the errant process in gdb.

hartzell commented 7 years ago

With one caveat, I have been unable to get the build to fail for go@1.8rc1.

The caveat is that I seem to need to adjust the maximum number of user processes upward. The default seems to be 4096. Left as is I see this:

##### ../test/bench/go1
testing: warning: no tests to run
PASS
ok      _/tmp/hartzelg/spack-stage/spack-stage-D6YdOI/go/test/bench/go1 1.916s

##### ../test
runtime: failed to create new OS thread (have 14 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

runtime stack:
runtime.throw(0x55a76e, 0x9)
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/panic.go:596 +0x95
runtime.newosproc(0xc42023e400, 0xc42012c000)
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/os_linux.go:163 +0x18c
runtime.newm(0x564320, 0x0)
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/proc.go:1628 +0x137
runtime.startTheWorldWithSema()
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/proc.go:1122 +0x1b6
runtime.systemstack(0xc420037300)
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/asm_amd64.s:327 +0x79
runtime.mstart()
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/proc.go:1132

goroutine 305 [running]:
runtime.systemstack_switch()
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/asm_amd64.s:281 fp=0xc42027ec98 sp=0xc42027ec90
runtime.gcStart(0x0, 0xc42055c000)
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/mgc.go:1070 +0x29d fp=0xc42027ecd0 sp=0xc42027ec98
runtime.mallocgc(0x80000, 0x52c080, 0xc420388001, 0xc42027edf0)
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/malloc.go:773 +0x48f fp=0xc42027ed70 sp=0xc42027ecd0
runtime.makeslice(0x52c080, 0x7fe00, 0x7fe00, 0x0, 0x0, 0x3fe00)
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/slice.go:54 +0x7b fp=0xc42027edc0 sp=0xc42027ed70
bytes.makeSlice(0x7fe00, 0x0, 0x0, 0x0)
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/bytes/buffer.go:201 +0x77 fp=0xc42027ee00 sp=0xc42027edc0
bytes.(*Buffer).ReadFrom(0xc420492230, 0x5f54c0, 0xc4203c20c8, 0x2aaaaacad028, 0xc420492230, 0x1)
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/bytes/buffer.go:173 +0x2ab fp=0xc42027ee70 sp=0xc42027ee00
io.copyBuffer(0x5f52c0, 0xc420492230, 0x5f54c0, 0xc4203c20c8, 0x0, 0x0, 0x0, 0x4c5fbb, 0xc420225600, 0x6c6961642f656301)
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/io/io.go:384 +0x2cb fp=0xc42027eed8 sp=0xc42027ee70
io.Copy(0x5f52c0, 0xc420492230, 0x5f54c0, 0xc4203c20c8, 0xc420492230, 0x302e312d32700101, 0xc420457fc8)
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/io/io.go:360 +0x68 fp=0xc42027ef38 sp=0xc42027eed8
os/exec.(*Cmd).writerDescriptor.func1(0x2d6b636170732f67, 0x6270722d656d6963)
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/os/exec/exec.go:254 +0x4d fp=0xc42027ef98 sp=0xc42027ef38
os/exec.(*Cmd).Start.func1(0xc420225600, 0xc4202f6020)
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/os/exec/exec.go:371 +0x27 fp=0xc42027efd0 sp=0xc42027ef98
runtime.goexit()
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc42027efd8 sp=0xc42027efd0
created by os/exec.(*Cmd).Start
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/os/exec/exec.go:372 +0x4e4

goroutine 1 [chan receive, 1 minutes]:
main.(*tester).runPending(0xc4201540f0, 0x0)
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/test.go:970 +0x31f
main.(*tester).run(0xc4201540f0)
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/test.go:214 +0x91a
main.cmdtest()
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/test.go:42 +0x2bb
main.xmain()
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/main.go:43 +0x233
main.main()
    /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/util.go:509 +0x283
[...]

If I ulimit -u 8192 then I am 5 for 5 successful builds for go@1.8rc1.

I'm inclined to close this and move on, but would appreciate feedback about whether there's something more use

hartzell commented 7 years ago

FWIW, go@1.7.5 and ulimit -u 8192 also seem to work.

hartzell commented 7 years ago

I've started seeing failures again. One thing that I've noticed is that the machine is very heavily loaded (another large compute bound job keeping all cores busy).

I see no sign of the straggling process that I reported earlier.

Here's a gist of the test failure.

https://gist.github.com/hartzell/5d3433593da9d679107c40273c4242f9

RLH commented 7 years ago

If the Go process is set to timeout after 3 minutes and the OS decides not to give the Go process any CPU for 3 minutes, I would not be surprised that the Go process times out.

On Mon, Feb 6, 2017 at 11:01 AM, George Hartzell notifications@github.com wrote:

I've started seeing failures again. One thing that I've noticed is that the machine is very heavily loaded (another large compute bound job keeping all cores busy).

Here's a gist of the test failure.

https://gist.github.com/hartzell/5d3433593da9d679107c40273c4242f9

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/18551#issuecomment-277726545, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7Wn5P0Os00Pdvhx4S-EYeHuz8sZsw9ks5rZ0PEgaJpZM4LdNRC .

hartzell commented 7 years ago

Is there a way to extend the timeout? @minux mentioned above that the test itself was quick and that if it took any appreciable time that it was an indication that something was amiss.

bradfitz commented 7 years ago

It takes 16 seconds on my 2 core machine. Extending the timeout doesn't seem like the answer.

hartzell commented 7 years ago

Hmmm. Would that scale linearly with core count? Would having 1.5TB of RAM (even if not all in use) make it take longer?

((16 seconds / 2 cores) * 144 cores ) * 1 minute/60 seconds = 19.2 minutes
hartzell commented 7 years ago

Reading through the test to look for timeouts etc... showed me a reference to GOMAXPROCS. For grins (aka grasping at staws) I just started a run with export GOMAXPROCS=8 near the top, thinking that it might limit the number of threads/goroutines/procs/... that were involved.

hartzell commented 7 years ago

It failed with GOMAXPROCS=8, but with a much shorter stack trace (which at least suggests that I turned a knob that was hooked up to something).

https://gist.github.com/hartzell/6363949312627aa1d417124e1ac7b3fc

I was back in that terminal session fairly quickly after it failed and did not see any build-related processes hanging around.

Is there any way that I can either instrument this or provoke it into telling me something useful?

hartzell commented 7 years ago

On rereading I was reminded that @minux pointed out that I can run the test "by hand". Using the build that just failed, I can't make it fail.

[hartzelg@rpbuchop001 go]$ GOROOT=`pwd` ./bin/go test runtime -run=TestCrashDumpsAllThreads
ok      runtime 0.864s
[hartzelg@rpbuchop001 go]$ GOROOT=`pwd` ./bin/go test runtime -run=TestCrashDumpsAllThreads
ok      runtime 0.882s
[hartzelg@rpbuchop001 go]$ uptime
 12:46:07 up 17 days, 23:14,  1 user,  load average: 142.63, 145.37, 146.39

And then just to work on my lunch for a while (each iteration seems to take ~5 seconds, the load is ~160, most cores around 100%):

[hartzelg@rpbuchop001 go]$ while (true)
> do
> GOROOT=`pwd` ./bin/go test runtime -run=TestCrashDumpsAllThreads
> done
ok      runtime 0.749s
ok      runtime 0.830s
ok      runtime 0.840s
ok      runtime 0.678s
ok      runtime 0.906s
ok      runtime 0.917s
ok      runtime 0.847s
ok      runtime 0.727s
ok      runtime 0.790s
ok      runtime 0.785s
ok      runtime 0.720s
ok      runtime 1.013s
ok      runtime 0.765s
ok      runtime 0.850s
ok      runtime 0.719s
ok      runtime 1.070s
ok      runtime 1.240s
ok      runtime 0.900s
ok      runtime 0.997s
ok      runtime 0.864s
ok      runtime 0.865s
ok      runtime 0.982s
ok      runtime 0.723s
ok      runtime 0.904s
ok      runtime 0.929s
ok      runtime 0.955s
ok      runtime 0.709s
ok      runtime 0.779s
ok      runtime 0.843s
ok      runtime 0.860s
ok      runtime 0.840s
ok      runtime 0.887s
ok      runtime 0.837s
ok      runtime 0.773s
ok      runtime 0.760s
ok      runtime 1.047s

Is there some way that running the build inside Spack (roughly analogous to homebrew...) could be making this fail. They do a bit of IO redirection magic but I don't think there's any deep voodoo.

davecheney commented 7 years ago

You should not set GOROOT if you have built from source.

On Tue, 7 Feb 2017, 08:23 George Hartzell notifications@github.com wrote:

On rereading I was reminded that @minux https://github.com/minux pointed out that I can run the test "by hand". Using the build that just failed, I can't make it fail.

[hartzelg@rpbuchop001 go]$ GOROOT=pwd ./bin/go test runtime -run=TestCrashDumpsAllThreads ok runtime 0.864s [hartzelg@rpbuchop001 go]$ GOROOT=pwd ./bin/go test runtime -run=TestCrashDumpsAllThreads ok runtime 0.882s [hartzelg@rpbuchop001 go]$ uptime 12:46:07 up 17 days, 23:14, 1 user, load average: 142.63, 145.37, 146.39

And then just to work on my lunch for a while (each iteration seems to take ~5 seconds, the load is ~160, most cores around 100%):

[hartzelg@rpbuchop001 go]$ while (true)

do GOROOT=pwd ./bin/go test runtime -run=TestCrashDumpsAllThreads done ok runtime 0.749s ok runtime 0.830s ok runtime 0.840s ok runtime 0.678s ok runtime 0.906s ok runtime 0.917s ok runtime 0.847s ok runtime 0.727s ok runtime 0.790s ok runtime 0.785s ok runtime 0.720s ok runtime 1.013s ok runtime 0.765s ok runtime 0.850s ok runtime 0.719s ok runtime 1.070s ok runtime 1.240s ok runtime 0.900s ok runtime 0.997s ok runtime 0.864s ok runtime 0.865s ok runtime 0.982s ok runtime 0.723s ok runtime 0.904s ok runtime 0.929s ok runtime 0.955s ok runtime 0.709s ok runtime 0.779s ok runtime 0.843s ok runtime 0.860s ok runtime 0.840s ok runtime 0.887s ok runtime 0.837s ok runtime 0.773s ok runtime 0.760s ok runtime 1.047s

Is there some way that running the build inside Spack https://github.com/llnl/spack (roughly analogous to homebrew...) could be making this fail. They do a bit of IO redirection magic but I don't think there's any deep voodoo.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/18551#issuecomment-277817955, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcA9gg-RtK_opY1oeqAIOkWrXrLaNWks5rZ49igaJpZM4LdNRC .

hartzell commented 7 years ago

[edit to add comment about what I was trying to do...]

@davecheney -- I'm not sure how else to do it. The python code that did the build (via spack) is here.

I came along afterwards and did this:

[hartzelg@rpbuchop001 go]$ ./bin/go version
go: cannot find GOROOT directory: /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.7.5-r2ql2ftjxnlu4esctfokszedfrooyf63

Actually, I was trying to run the tests as @minux suggested, version is just a standin....

Setting GOROOT seemed like an expedient way to use the un-installed tree full of stuff that I'd built and appears (Danger Will Robinson...) to work.

What should I have done?

davecheney commented 7 years ago

If your CI builds go in a random directory then moves the files after th build it should set the variable GOROOT_FINAL to the final location.

On Tue, 7 Feb 2017, 11:56 George Hartzell notifications@github.com wrote:

@davecheney https://github.com/davecheney -- I'm not sure how else to do it. The python code that did the build (via spack) is here https://github.com/LLNL/spack/blob/develop/var/spack/repos/builtin/packages/go/package.py .

I came along afterwards and did this:

[hartzelg@rpbuchop001 go]$ ./bin/go version go: cannot find GOROOT directory: /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.7.5-r2ql2ftjxnlu4esctfokszedfrooyf63

Setting GOROOT seemed like an expedient way to use the un-installed tree full of stuff that I'd built and appears (Danger Will Robinson...) to work.

What should I have done?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/18551#issuecomment-277865105, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcA-1n0aP-0FRASR6Pm5KSfy5WQGjbks5rZ8E-gaJpZM4LdNRC .

hartzell commented 7 years ago

Sure. But the problem is that I'm poking at a build that failed, so nothing got copied anywhere and I'm trying to run the intermediate product through one of the tests by hand to see if I can figure out the heisenbug.

davecheney commented 7 years ago

Is it possible to take your build system out of the picture and build from source directly? I'm really concerned that setting GOROOT introduces the possibility of pointing the go tool at a different version of Go which has a long history of causing weird bugs.

On Tue, Feb 7, 2017 at 3:27 PM, George Hartzell notifications@github.com wrote:

Sure. But the problem is that I'm poking at a build that failed, so nothing got copied anywhere and I'm trying to run the intermediate product through one of the tests by hand to see if I can figure out the heisenbug.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/18551#issuecomment-277898408, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcA2NsK0mwk8KdDgNYCf6o9flq-R6Cks5rZ_KhgaJpZM4LdNRC .

hartzell commented 7 years ago

I can (and will) try to recreate the failure running the build by hand. That might be a useful simplification.

The build system does not use GOROOT (at least not in any way that I can suss out). The failures that happen there are almost certainly not related to GOROOT.

The build does set GOROOT_FINAL to the location where the tree will be copied if the build succeeds. That seems to be consistent with the documentation for building from source.

The only use of GOROOT, which you've latched on to, was me trying to figure out how to follow @minux's instructions on how to run the failing test by hand. I take your point to heart and believe that the results I get when I use GOROOT to run the tests within the staging directory might be misleading.

Given that I have a tree that's been built with GOROOT_FINAL set and that I am trying to debug a problem with that tree, what is my alternative to setting GOROOT?

hartzell commented 7 years ago

I've been running the build by hand.

I have ulimit -u 8192, otherwise I see this:

##### ../test
runtime: failed to create new OS thread (have 11 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

The build just succeeded 3 for 3. That's not as comforting as it might be, there have been periods where it has worked via Spack before.

The automated (Jenkins, Spack) build failed last night. The automated job has export GOMAXPROCS=4 because I was playing with it and left it in. The failure was the same one I've been seeing, with the smaller stack trace that results from setting GOMAXPROCS.

The machine has a long running job that is keeping all of the cores mostly busy, it was running yesterday and last night and is running now. Load is around 140. Plenty of free RAM.

The Spack build framework does a bit of magic with filehandles (and possibly other things) to manage the tasks that it runs.

@minux mentions that the failing test (or the thing being tested) involves "signal trickery". Is it possible that the Spack framework is doing something that's getting in the way?

That seems fragile, and if it were then I'd expect that the Spack build would never succeed, but I'm grasping at straws, so....

hartzell commented 7 years ago

I'm still casting about for a way to get a handle on this.

As I've detailed above, the core of my test case is building go via spack, like so:

spack install go@1.8   # 1.8 has been released since I started this thread....

I run a build every night as a Jenkins job. The Jenkins master is a Docker container and the job runs in a slave executor on a real Linux system (no VM, no container). That build seems to nearly always fail.

I sometimes checkout a Spack tree and run the command by hand, that seems to generally work these days (though when I filed this Issue my initial test case w/ 1.7.3 failed when run by hand).

I can also cd into the Spack clone that failed e.g. last night and run spack install go@1.8 and it works consistently (since 1.8 at least).

I've moved the build from the 144 core machine onto a 24-core box. When building by hand (at least) I no longer need to ulimit -u 8192). The jenkins build still fails. Runs by hand still work.

Is there anything about runtime_test.TestCrashDumpsAllThreads that might interact badly with a Jenkins master->slave ssh session?

hartzell commented 7 years ago

I've pulled go build out into a separate job that just does

#!/bin/sh
set -e
spack install go@1.8

This builds a bootstrap go@1.4 and uses it to build 1.8.

It crashes. Here is a gist of an example crash. It seems to be a timeout but does not mention TestCrashDumpsAllThreadstest as the earlier ones did.

On the other hand, the next one (run without the -e) does mention TestCrashDumpsAllThreadtest gist here.

hartzell commented 7 years ago

I've reproduced this problem on a smaller machine and w/out involving Spack.

Jenkins seems to be central to the problem. I've opened a new issue: #19203.