Closed hartzell closed 7 years ago
The test that times out is: runtime_test.TestCrashDumpsAllThreads
Do you still have the a.exe running? Is it possible to obtain a backtrace of all running threads using gdb?
Are you sure this is related to the machine's size? Does this build fail the same way on a smaller machine with the same software?
Do you still have the a.exe running? Is it possible to obtain a backtrace of all running threads using gdb?
No, it is not still running. I have something else running for a bit, but can rerun this later and if it ends up in the same state can to grab the backtrace (but below...).
Can you give me instructions on getting the backtrace?
Are you sure this is related to the machine's size?
Not sure, but it seems related to this box. I was originally building successfully on a Digital Ocean 16GB instance without any trouble. I anecdotally believe that I've successfully built "in-house" on other systems, but since it occasionally succeeds on the big box I'm not really sure.
It's also possible that the box has hardware issues (though I haven't seen any other tools).
I can find/use a smaller box with the same config and see what it does though it'll take a day or two.
Find out the pid of the stray process, run gdb -p $PID and at the gdb prompt, run the following:
set logging file /tmp/gdb.log set logging on set height 0 thread apply all bt quit
And then paste the output (in /tmp/gdb.log).
You also try running the test directly with (after building with make.bash): go test runtime -run=TestCrashDumpsAllThreads
Also note that Go 1.8 will be released in a month, so please also give the latest master branch a try.
The test that times out is: runtime_test.TestCrashDumpsAllThreads
Is it possible that what ever it's doing just takes more than the test timeout (and more than the 15 minutes that passes while I was writing up the issue), perhaps because of the size of the system?
On non intel systems I'd say yes. But on intel, when this test hangs, adding more time won't help, it's either a real bug or a flakey test.
On Sat, Jan 7, 2017 at 10:30 AM, George Hartzell notifications@github.com wrote:
The test that times out is: runtime_test.TestCrashDumpsAllThreads
Is it possible that what ever it's doing just takes more than the test timeout (and more than the 15 minutes that passes while I was writing up the issue), perhaps because of the size of the system?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/18551#issuecomment-271038511, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcA3G5avH-RZ8_gMFMmp7dyYO4uYISks5rPs6tgaJpZM4LdNRC .
Between filing the issue and reading @minux and @bradfitz followups, I started a run with GOMAXPROCS=16
in a grasping at straws debugging effort and switched the working directory from an NFS mount to something local (because debugging's more fun if you change multiple things at the same time, sigh...; but seriously I'm also trying to debug some infra. issues).
I'll let it finish (because I'm Curious), swap the working dir back and see if I can catch that "leftover" process running.
NFS has been problematic in the past, IIRC.
NFS has been problematic in the past, IIRC.
That's generally how I feel about it... 🙁
The TestCrashDumpsAllThreadstest ( https://golang.org/src/runtime/crash_unix_test.go#L26) shouldn't take much time to run. If it hangs, it's usually some problem in the implementation of crash dump (which involves some signal trickery.)
From the test log, the test timed out while waiting for the process to finish the crash dump: goroutine 22861 [syscall, 2 minutes]: syscall.Syscall6(0xf7, 0x1, 0x1718d, 0xc420474000, 0x1000004, 0x0, 0x0, 0xc42010bb07, 0xc420037500, 0xc42014cd00) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/syscall/asm_linux_amd64.s:44 +0x5 fp=0xc42010baf0 sp=0xc42010bae8 os.(Process).blockUntilWaitable(0xc42000d230, 0xe, 0xc42010bd1f, 0x1) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/os/wait_waitid.go:28 +0xbc fp=0xc42010bb88 sp=0xc42010baf0 os.(Process).wait(0xc42000d230, 0x506471, 0xc42000d244, 0x3) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/os/exec_unix.go:22 +0xab fp=0xc42010bc18 sp=0xc42010bb88 os.(Process).Wait(0xc42000d230, 0x0, 0x3, 0x0) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/os/doc.go:49 +0x2b fp=0xc42010bc48 sp=0xc42010bc18 os/exec.(Cmd).Wait(0xc420190160, 0x6f7f00, 0xc42010bd60) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/os/exec/exec.go:434 +0x6d fp=0xc42010bcd8 sp=0xc42010bc48 runtime_test.TestCrashDumpsAllThreads(0xc42001a240) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/crash_unix_test.go:99 +0x849 fp=0xc42010bf78 sp=0xc42010bcd8 testing.tRunner(0xc42001a240, 0x62a6c8) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:610 +0x81 fp=0xc42010bfa0 sp=0xc42010bf78 runtime.goexit() /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42010bfa8 sp=0xc42010bfa0 created by testing.(*T).Run /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:646 +0x2ec
Therefore the test has already successfully send SIGQUIT to the process, but the process fails to stop, indicating something is wrong about signal handling in the runtime.
Having gotten the snark out of my system, it's worth noting that even though I've moved the Jenkins working directory, Spack (unless you configure it otherwise) runs its unbundle, build and install steps in a directory tree symlinked into /tmp
, which in my case is a 15TB local filesystem.
The change will speed up some other parts of the job (recursively removing the tree takes forever in our NFS environment) but after consideration I don't think it will change the outcome of this issue.
Still, it's a change and it's worth being up front about it.
[sorry for the delay, been down with a cold...]
Last week I fired off a build w/ GOMAXPROCS=16 before I received the feedback above.
The gist is here.
I think it's different, but I'm guessing....
Unless there's new feedback my next step will be to try to recreate the original crash and catch the errant process in gdb.
With one caveat, I have been unable to get the build to fail for go@1.8rc1.
The caveat is that I seem to need to adjust the maximum number of user processes upward. The default seems to be 4096. Left as is I see this:
##### ../test/bench/go1
testing: warning: no tests to run
PASS
ok _/tmp/hartzelg/spack-stage/spack-stage-D6YdOI/go/test/bench/go1 1.916s
##### ../test
runtime: failed to create new OS thread (have 14 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
runtime stack:
runtime.throw(0x55a76e, 0x9)
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/panic.go:596 +0x95
runtime.newosproc(0xc42023e400, 0xc42012c000)
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/os_linux.go:163 +0x18c
runtime.newm(0x564320, 0x0)
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/proc.go:1628 +0x137
runtime.startTheWorldWithSema()
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/proc.go:1122 +0x1b6
runtime.systemstack(0xc420037300)
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/asm_amd64.s:327 +0x79
runtime.mstart()
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/proc.go:1132
goroutine 305 [running]:
runtime.systemstack_switch()
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/asm_amd64.s:281 fp=0xc42027ec98 sp=0xc42027ec90
runtime.gcStart(0x0, 0xc42055c000)
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/mgc.go:1070 +0x29d fp=0xc42027ecd0 sp=0xc42027ec98
runtime.mallocgc(0x80000, 0x52c080, 0xc420388001, 0xc42027edf0)
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/malloc.go:773 +0x48f fp=0xc42027ed70 sp=0xc42027ecd0
runtime.makeslice(0x52c080, 0x7fe00, 0x7fe00, 0x0, 0x0, 0x3fe00)
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/slice.go:54 +0x7b fp=0xc42027edc0 sp=0xc42027ed70
bytes.makeSlice(0x7fe00, 0x0, 0x0, 0x0)
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/bytes/buffer.go:201 +0x77 fp=0xc42027ee00 sp=0xc42027edc0
bytes.(*Buffer).ReadFrom(0xc420492230, 0x5f54c0, 0xc4203c20c8, 0x2aaaaacad028, 0xc420492230, 0x1)
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/bytes/buffer.go:173 +0x2ab fp=0xc42027ee70 sp=0xc42027ee00
io.copyBuffer(0x5f52c0, 0xc420492230, 0x5f54c0, 0xc4203c20c8, 0x0, 0x0, 0x0, 0x4c5fbb, 0xc420225600, 0x6c6961642f656301)
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/io/io.go:384 +0x2cb fp=0xc42027eed8 sp=0xc42027ee70
io.Copy(0x5f52c0, 0xc420492230, 0x5f54c0, 0xc4203c20c8, 0xc420492230, 0x302e312d32700101, 0xc420457fc8)
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/io/io.go:360 +0x68 fp=0xc42027ef38 sp=0xc42027eed8
os/exec.(*Cmd).writerDescriptor.func1(0x2d6b636170732f67, 0x6270722d656d6963)
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/os/exec/exec.go:254 +0x4d fp=0xc42027ef98 sp=0xc42027ef38
os/exec.(*Cmd).Start.func1(0xc420225600, 0xc4202f6020)
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/os/exec/exec.go:371 +0x27 fp=0xc42027efd0 sp=0xc42027ef98
runtime.goexit()
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc42027efd8 sp=0xc42027efd0
created by os/exec.(*Cmd).Start
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/os/exec/exec.go:372 +0x4e4
goroutine 1 [chan receive, 1 minutes]:
main.(*tester).runPending(0xc4201540f0, 0x0)
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/test.go:970 +0x31f
main.(*tester).run(0xc4201540f0)
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/test.go:214 +0x91a
main.cmdtest()
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/test.go:42 +0x2bb
main.xmain()
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/main.go:43 +0x233
main.main()
/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/util.go:509 +0x283
[...]
If I ulimit -u 8192
then I am 5 for 5 successful builds for go@1.8rc1.
I'm inclined to close this and move on, but would appreciate feedback about whether there's something more use
FWIW, go@1.7.5 and ulimit -u 8192
also seem to work.
I've started seeing failures again. One thing that I've noticed is that the machine is very heavily loaded (another large compute bound job keeping all cores busy).
I see no sign of the straggling process that I reported earlier.
Here's a gist of the test failure.
https://gist.github.com/hartzell/5d3433593da9d679107c40273c4242f9
If the Go process is set to timeout after 3 minutes and the OS decides not to give the Go process any CPU for 3 minutes, I would not be surprised that the Go process times out.
On Mon, Feb 6, 2017 at 11:01 AM, George Hartzell notifications@github.com wrote:
I've started seeing failures again. One thing that I've noticed is that the machine is very heavily loaded (another large compute bound job keeping all cores busy).
Here's a gist of the test failure.
https://gist.github.com/hartzell/5d3433593da9d679107c40273c4242f9
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/18551#issuecomment-277726545, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7Wn5P0Os00Pdvhx4S-EYeHuz8sZsw9ks5rZ0PEgaJpZM4LdNRC .
Is there a way to extend the timeout? @minux mentioned above that the test itself was quick and that if it took any appreciable time that it was an indication that something was amiss.
It takes 16 seconds on my 2 core machine. Extending the timeout doesn't seem like the answer.
Hmmm. Would that scale linearly with core count? Would having 1.5TB of RAM (even if not all in use) make it take longer?
((16 seconds / 2 cores) * 144 cores ) * 1 minute/60 seconds = 19.2 minutes
Reading through the test to look for timeouts etc... showed me a reference to GOMAXPROCS. For grins (aka grasping at staws) I just started a run with export GOMAXPROCS=8
near the top, thinking that it might limit the number of threads/goroutines/procs/... that were involved.
It failed with GOMAXPROCS=8, but with a much shorter stack trace (which at least suggests that I turned a knob that was hooked up to something).
https://gist.github.com/hartzell/6363949312627aa1d417124e1ac7b3fc
I was back in that terminal session fairly quickly after it failed and did not see any build-related processes hanging around.
Is there any way that I can either instrument this or provoke it into telling me something useful?
On rereading I was reminded that @minux pointed out that I can run the test "by hand". Using the build that just failed, I can't make it fail.
[hartzelg@rpbuchop001 go]$ GOROOT=`pwd` ./bin/go test runtime -run=TestCrashDumpsAllThreads
ok runtime 0.864s
[hartzelg@rpbuchop001 go]$ GOROOT=`pwd` ./bin/go test runtime -run=TestCrashDumpsAllThreads
ok runtime 0.882s
[hartzelg@rpbuchop001 go]$ uptime
12:46:07 up 17 days, 23:14, 1 user, load average: 142.63, 145.37, 146.39
And then just to work on my lunch for a while (each iteration seems to take ~5 seconds, the load is ~160, most cores around 100%):
[hartzelg@rpbuchop001 go]$ while (true)
> do
> GOROOT=`pwd` ./bin/go test runtime -run=TestCrashDumpsAllThreads
> done
ok runtime 0.749s
ok runtime 0.830s
ok runtime 0.840s
ok runtime 0.678s
ok runtime 0.906s
ok runtime 0.917s
ok runtime 0.847s
ok runtime 0.727s
ok runtime 0.790s
ok runtime 0.785s
ok runtime 0.720s
ok runtime 1.013s
ok runtime 0.765s
ok runtime 0.850s
ok runtime 0.719s
ok runtime 1.070s
ok runtime 1.240s
ok runtime 0.900s
ok runtime 0.997s
ok runtime 0.864s
ok runtime 0.865s
ok runtime 0.982s
ok runtime 0.723s
ok runtime 0.904s
ok runtime 0.929s
ok runtime 0.955s
ok runtime 0.709s
ok runtime 0.779s
ok runtime 0.843s
ok runtime 0.860s
ok runtime 0.840s
ok runtime 0.887s
ok runtime 0.837s
ok runtime 0.773s
ok runtime 0.760s
ok runtime 1.047s
Is there some way that running the build inside Spack (roughly analogous to homebrew...) could be making this fail. They do a bit of IO redirection magic but I don't think there's any deep voodoo.
You should not set GOROOT if you have built from source.
On Tue, 7 Feb 2017, 08:23 George Hartzell notifications@github.com wrote:
On rereading I was reminded that @minux https://github.com/minux pointed out that I can run the test "by hand". Using the build that just failed, I can't make it fail.
[hartzelg@rpbuchop001 go]$ GOROOT=
pwd
./bin/go test runtime -run=TestCrashDumpsAllThreads ok runtime 0.864s [hartzelg@rpbuchop001 go]$ GOROOT=pwd
./bin/go test runtime -run=TestCrashDumpsAllThreads ok runtime 0.882s [hartzelg@rpbuchop001 go]$ uptime 12:46:07 up 17 days, 23:14, 1 user, load average: 142.63, 145.37, 146.39And then just to work on my lunch for a while (each iteration seems to take ~5 seconds, the load is ~160, most cores around 100%):
[hartzelg@rpbuchop001 go]$ while (true)
do GOROOT=
pwd
./bin/go test runtime -run=TestCrashDumpsAllThreads done ok runtime 0.749s ok runtime 0.830s ok runtime 0.840s ok runtime 0.678s ok runtime 0.906s ok runtime 0.917s ok runtime 0.847s ok runtime 0.727s ok runtime 0.790s ok runtime 0.785s ok runtime 0.720s ok runtime 1.013s ok runtime 0.765s ok runtime 0.850s ok runtime 0.719s ok runtime 1.070s ok runtime 1.240s ok runtime 0.900s ok runtime 0.997s ok runtime 0.864s ok runtime 0.865s ok runtime 0.982s ok runtime 0.723s ok runtime 0.904s ok runtime 0.929s ok runtime 0.955s ok runtime 0.709s ok runtime 0.779s ok runtime 0.843s ok runtime 0.860s ok runtime 0.840s ok runtime 0.887s ok runtime 0.837s ok runtime 0.773s ok runtime 0.760s ok runtime 1.047sIs there some way that running the build inside Spack https://github.com/llnl/spack (roughly analogous to homebrew...) could be making this fail. They do a bit of IO redirection magic but I don't think there's any deep voodoo.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/18551#issuecomment-277817955, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcA9gg-RtK_opY1oeqAIOkWrXrLaNWks5rZ49igaJpZM4LdNRC .
[edit to add comment about what I was trying to do...]
@davecheney -- I'm not sure how else to do it. The python code that did the build (via spack) is here.
I came along afterwards and did this:
[hartzelg@rpbuchop001 go]$ ./bin/go version
go: cannot find GOROOT directory: /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.7.5-r2ql2ftjxnlu4esctfokszedfrooyf63
Actually, I was trying to run the tests as @minux suggested, version is just a standin....
Setting GOROOT seemed like an expedient way to use the un-installed tree full of stuff that I'd built and appears (Danger Will Robinson...) to work.
What should I have done?
If your CI builds go in a random directory then moves the files after th build it should set the variable GOROOT_FINAL to the final location.
On Tue, 7 Feb 2017, 11:56 George Hartzell notifications@github.com wrote:
@davecheney https://github.com/davecheney -- I'm not sure how else to do it. The python code that did the build (via spack) is here https://github.com/LLNL/spack/blob/develop/var/spack/repos/builtin/packages/go/package.py .
I came along afterwards and did this:
[hartzelg@rpbuchop001 go]$ ./bin/go version go: cannot find GOROOT directory: /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.7.5-r2ql2ftjxnlu4esctfokszedfrooyf63
Setting GOROOT seemed like an expedient way to use the un-installed tree full of stuff that I'd built and appears (Danger Will Robinson...) to work.
What should I have done?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/18551#issuecomment-277865105, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcA-1n0aP-0FRASR6Pm5KSfy5WQGjbks5rZ8E-gaJpZM4LdNRC .
Sure. But the problem is that I'm poking at a build that failed, so nothing got copied anywhere and I'm trying to run the intermediate product through one of the tests by hand to see if I can figure out the heisenbug.
Is it possible to take your build system out of the picture and build from source directly? I'm really concerned that setting GOROOT introduces the possibility of pointing the go tool at a different version of Go which has a long history of causing weird bugs.
On Tue, Feb 7, 2017 at 3:27 PM, George Hartzell notifications@github.com wrote:
Sure. But the problem is that I'm poking at a build that failed, so nothing got copied anywhere and I'm trying to run the intermediate product through one of the tests by hand to see if I can figure out the heisenbug.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/18551#issuecomment-277898408, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcA2NsK0mwk8KdDgNYCf6o9flq-R6Cks5rZ_KhgaJpZM4LdNRC .
I can (and will) try to recreate the failure running the build by hand. That might be a useful simplification.
The build system does not use GOROOT (at least not in any way that I can suss out). The failures that happen there are almost certainly not related to GOROOT.
The build does set GOROOT_FINAL to the location where the tree will be copied if the build succeeds. That seems to be consistent with the documentation for building from source.
The only use of GOROOT, which you've latched on to, was me trying to figure out how to follow @minux's instructions on how to run the failing test by hand. I take your point to heart and believe that the results I get when I use GOROOT to run the tests within the staging directory might be misleading.
Given that I have a tree that's been built with GOROOT_FINAL set and that I am trying to debug a problem with that tree, what is my alternative to setting GOROOT?
I've been running the build by hand.
I have ulimit -u 8192
, otherwise I see this:
##### ../test
runtime: failed to create new OS thread (have 11 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
The build just succeeded 3 for 3. That's not as comforting as it might be, there have been periods where it has worked via Spack before.
The automated (Jenkins, Spack) build failed last night. The automated job has export GOMAXPROCS=4
because I was playing with it and left it in. The failure was the same one I've been seeing, with the smaller stack trace that results from setting GOMAXPROCS.
The machine has a long running job that is keeping all of the cores mostly busy, it was running yesterday and last night and is running now. Load is around 140. Plenty of free RAM.
The Spack build framework does a bit of magic with filehandles (and possibly other things) to manage the tasks that it runs.
@minux mentions that the failing test (or the thing being tested) involves "signal trickery". Is it possible that the Spack framework is doing something that's getting in the way?
That seems fragile, and if it were then I'd expect that the Spack build would never succeed, but I'm grasping at straws, so....
I'm still casting about for a way to get a handle on this.
As I've detailed above, the core of my test case is building go via spack, like so:
spack install go@1.8 # 1.8 has been released since I started this thread....
I run a build every night as a Jenkins job. The Jenkins master is a Docker container and the job runs in a slave executor on a real Linux system (no VM, no container). That build seems to nearly always fail.
I sometimes checkout a Spack tree and run the command by hand, that seems to generally work these days (though when I filed this Issue my initial test case w/ 1.7.3 failed when run by hand).
I can also cd into the Spack clone that failed e.g. last night and run spack install go@1.8
and it works consistently (since 1.8 at least).
I've moved the build from the 144 core machine onto a 24-core box. When building by hand (at least) I no longer need to ulimit -u 8192
). The jenkins build still fails. Runs by hand still work.
Is there anything about runtime_test.TestCrashDumpsAllThreads
that might interact badly with a Jenkins master->slave ssh session?
I've pulled go build out into a separate job that just does
#!/bin/sh
set -e
spack install go@1.8
This builds a bootstrap go@1.4 and uses it to build 1.8.
It crashes. Here is a gist of an example crash. It seems to be a timeout but does not mention TestCrashDumpsAllThreadstest
as the earlier ones did.
On the other hand, the next one (run without the -e
) does mention TestCrashDumpsAllThreadtest
gist here.
I've reproduced this problem on a smaller machine and w/out involving Spack.
Jenkins seems to be central to the problem. I've opened a new issue: #19203.
I happen to have access to a large system and am using it to build a variety of packages. My go builds almost always fail. Details below.
What version of Go are you using (
go version
)?Bootstrapping go: from here: https://storage.googleapis.com/golang/go1.4-bootstrap-20161024.tar.gz
Go from here: https://storage.googleapis.com/golang/go1.7.4.src.tar.gz
What operating system and processor architecture are you using (
go env
)?What did you do?
Building the go recipe in spack like so:
(which builds and uses a bootstrapping system from the 1.4 snapshot).
It boils down to
'/usr/bin/bash' 'all.bash'
.I'm seeing this in a job running via jenkins that simplifies to this:
What did you expect to see?
A successful build.
What did you see instead?
A panic during testing. Full output (1500+ lines) in this gist.
Here's the first few lines.
Some time later (15 minutes after the test failed, I still have a go-build-related process running:
and jenkins thinks that the job is still running.