golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.99k stars 17.67k forks source link

runtime: FreeBSD memory corruption involving fork system call #15658

Closed derekmarcotte closed 5 years ago

derekmarcotte commented 8 years ago

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?

go version go1.6.2 freebsd/amd64

  1. What operating system and processor architecture are you using (go env)?
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="freebsd"
GOOS="freebsd"
GOPATH=""
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/freebsd_amd64"
GO15VENDOREXPERIMENT="1"
CC="cc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="clang++"
  1. What did you do?
package main

/* stdlib includes */
import (
        "fmt"
        "os/exec"
)

func run(done chan struct{}) {
        cmd := exec.Command("true")
        if err := cmd.Start(); err != nil {
                goto finished
        }

        cmd.Wait()

finished:
        done <- struct{}{}
        return
}

func main() {
        fmt.Println("Starting a bunch of goroutines...")

        // 8 & 16 are arbitrary
        done := make(chan struct{}, 16)

        for i := 0; i < 8; i++ {
                go run(done)
        }

        for {
                select {
                case <-done:
                        go run(done)
                }
        }
}
  1. What did you expect to see?

I expect this strange program to spawn instances of /bin/true in parallel, until I stop it.

  1. What did you see instead?

Various types of panics caused by what looks to be corruption within the finalizer lists, caused by what I am assuming is based on race conditions. These panics can happen as quickly as 2 minutes, or much longer. 10 minutes seems a good round number.

Occasionally addspecial gets stuck in an infinite loop holding the lock, and the process wedges. This is illustrated in log 1462933614, with x.next pointing to x. This appears to be corruption of that data structure. I have seen processes in this state run for 22 hours.

I understand there is some trepidation expressed in issue #11485 around the locking of the data structures involved.

Here are some sample messages:

1462926841-SetFinalizer-ex1.txt 1462926969-SetFinalizer-ex2.txt 1462933295-nonempty-check-fails.txt 1462933614-wedged.txt

This was run on an 8-core processor, and a 4-core 8-thread processor with ECC RAM, similar results.

Additionally, while this example is an extreme, it also represents the core functionality of a project I've been working on part-time for many months. I'm happy to provide any further assistance diagnosing this issue - I'm very invested!

derekmarcotte commented 8 years ago

Here's another panic experienced in mallocgc by the same sample code:

1463046620-malloc-panic.txt

bradfitz commented 8 years ago

@derekmarcotte, can you also reproduce this at master? (which will become Go 1.7)

And do you only see it on FreeBSD, or other operating systems as well?

/cc @aclements @ianlancetaylor

aclements commented 8 years ago

@RLH, weren't you seeing "finalizer already set" failures a little while ago? Did you track that down? Could it have been related?

aclements commented 8 years ago

On closer inspection of the failures you posted (thanks for collecting several, BTW), this smells like memory corruption. Just guessing, there are a few likely culprits. It may be the finalizer code, but I actually think that's less likely. More likely is that it's the fork/exec code: that code is really subtle, mucks with the address space, and contains system-specific parts (which would explain why it's showing up on FreeBSD, but I haven't been able to reproduce it on Linux yet).

@derekmarcotte, can you try commenting out the runtime.SetFinalizer call in newProcess in os/exec.go (your test doesn't need that finalizer) and see if you can still reproduce it? If you can, that will rule out finalizers.

bradfitz commented 8 years ago

Note that FreeBSD runs via gomote, if this is that easily reproducible. I haven't yet tried.

derekmarcotte commented 8 years ago

Just got a golang/go dev environment set up on my machine (was from FreeBSD packages). Will report back soon.

derekmarcotte commented 8 years ago

Here's the heads of a bunch of logs with the epoch at the start of the process, so you can see the interval. I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process. I thought maybe the gc was setting these as free (or otherwise touching them) before the SetFinalizer had a chance to set their value.

I didn't include too many of them in my initial report, because I thought they were largely redundant.

@bradfitz: these logs are against master:

1463168442
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420059f48 sp=0xc420059f30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420059f80 sp=0xc420059f48
runtime.systemstack(0xc420019500)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420059f88 sp=0xc420059f80

1463170469
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0x7fffffffea80 sp=0x7fffffffea68
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0x7fffffffeab8 sp=0x7fffffffea80
runtime.systemstack(0x52ad00)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0x7fffffffeac0 sp=0x7fffffffeab8

1463170494
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420067f48 sp=0xc420067f30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420067f80 sp=0xc420067f48
runtime.systemstack(0xc420019500)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420067f88 sp=0xc420067f80

1463171133
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc4202cbf48 sp=0xc4202cbf30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc4202cbf80 sp=0xc4202cbf48
runtime.systemstack(0xc42001c000)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc4202cbf88 sp=0xc4202cbf80

@aclements: I will try your patch next. One variable at a time.

RLH commented 8 years ago

The "fatal error: runtime.SetFinalizer: finalizer already set" bug I was seeing a few days ago were on the 1.8 dev.garbage branch and the result of a TOC write barrier not marking an object as being published, TOC doing a rollback, and reallocating a new object over an old one. Both objects had finalizers and the bug was tripped.

None of the TOC code is in 1.7 or for that matter on TIP. I can force myself to imagine that the 1.7 allocCache code could cause a similar situation if the allocCache and bitmaps were not coherent. but if that were the case it would expect lots of failures all over the place and across all platforms, not just freeBSD.

On Fri, May 13, 2016 at 6:47 PM, Derek Marcotte notifications@github.com wrote:

Here's the heads of a bunch of logs with the epoch at the start of the process, so you can see the interval. I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process. I thought maybe the gc was setting these as free (or otherwise touching them) before the SetFinalizer had a chance to set their value.

I didn't include too many of them in my initial report, because I thought they were largely redundant.

@bradfitz https://github.com/bradfitz: these logs are against master:

1463168442 Starting a bunch of goroutines... fatal error: runtime.SetFinalizer: finalizer already set

runtime stack: runtime.throw(0x4c366d, 0x2b) /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420059f48 sp=0xc420059f30 runtime.SetFinalizer.func2() /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420059f80 sp=0xc420059f48 runtime.systemstack(0xc420019500) /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420059f88 sp=0xc420059f80

1463170469 Starting a bunch of goroutines... fatal error: runtime.SetFinalizer: finalizer already set

runtime stack: runtime.throw(0x4c366d, 0x2b) /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0x7fffffffea80 sp=0x7fffffffea68 runtime.SetFinalizer.func2() /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0x7fffffffeab8 sp=0x7fffffffea80 runtime.systemstack(0x52ad00) /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0x7fffffffeac0 sp=0x7fffffffeab8

1463170494 Starting a bunch of goroutines... fatal error: runtime.SetFinalizer: finalizer already set

runtime stack: runtime.throw(0x4c366d, 0x2b) /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420067f48 sp=0xc420067f30 runtime.SetFinalizer.func2() /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420067f80 sp=0xc420067f48 runtime.systemstack(0xc420019500) /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420067f88 sp=0xc420067f80

1463171133 Starting a bunch of goroutines... fatal error: runtime.SetFinalizer: finalizer already set

runtime stack: runtime.throw(0x4c366d, 0x2b) /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc4202cbf48 sp=0xc4202cbf30 runtime.SetFinalizer.func2() /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc4202cbf80 sp=0xc4202cbf48 runtime.systemstack(0xc42001c000) /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc4202cbf88 sp=0xc4202cbf80

@aclements https://github.com/aclements: I will try your patch next. One variable at a time.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/golang/go/issues/15658#issuecomment-219177155

derekmarcotte commented 8 years ago

@aclements: I've run it with your patch, although I haven't been able to babysit it too much.

The first time, all threads were idle after a number of hours (i.e. 0% cpu across the board). Connecting gdb to that process gave me trouble, and I couldn't get any logging out of it.

This morning, I was able to connect to a different process that looks a lot like 1462933614-wedged.txt. I've attached a log from gdb there:

1463270694.txt

Will keep trying to come up with more info.

derekmarcotte commented 8 years ago

@aclements: Here's some more logs from a binary build with the patch:

1463315708-finalizer-already-set.txt 1463349807-finalizer-already-set.txt 1463352601-workbuf-not-empty.txt 1463362849-workbuf-empty.txt 1463378745-wedged-gdb.txt

Please let me know if I can be of further assistance?

aclements commented 8 years ago

Thanks for the logs! I hadn't realized there were two finalizers involved here. Could you also comment out the SetFinalizer in NewFile in os/file_unix.go and see if it's still reproducible? (Your test also doesn't need that finalizer.)

aclements commented 8 years ago

I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process.

I didn't mean to say that it isn't necessarily a race. It's actually quite likely a race, but it's resulting in corruption of internal runtime structures, which suggests that the race is happening on freed memory. The "workbuf is empty" failure mode especially points at memory corruption, which is why my initial guess is that the finalizers (and the specials queue in general) may be victims rather than perpetrators. It's also easy to get finalizers out of the picture, while it's harder to get fork/exec out of the picture without completely changing the program. :)

derekmarcotte commented 8 years ago

Thanks @aclements !

One crash, ~4 hours into running, since removing the SetFinalizer in NewFile.

1463429459.txt

I have a second process running for almost 11 hours, with 4 of the threads wedged, but it is still doing work.

derekmarcotte commented 8 years ago

After ~11 hours + 5 minutes, the process panic'd:

1463445934-invalid-p-state.txt

aclements commented 8 years ago

Thanks for the new logs. I suspect that's actually not the same failure, which suggests that the original problem is in fact related to finalizers.

derekmarcotte commented 8 years ago

My pleasure, thanks for the feedback. Are there any next steps for me? (I'm likely to poke around this bottom issue, although I'm neither a go runtime guy, nor a FreeBSD systems-level guy - yet. Would like to be as helpful as possible.)

Thanks again!

derekmarcotte commented 8 years ago

I'm going to post a few more here. This'll be my last batch, unless requested. :smile: 1463584079 is a new message.

 ==> 1463485780 <==
Starting a bunch of goroutines...

fatal error: workbuf is not empty

runtime stack:
runtime.throw(0x4bff02, 0x14)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.(*workbuf).checkempty(0x8007dae00)
        /home/derek/go/src/github.com/golang/go/src/runtime/mgcwork.go:301 +0x3f
runtime.getempty(0x8007dae00)

==> 1463584079 <==
Starting a bunch of goroutines...
fatal error: MHeap_AllocLocked - MSpan not free

runtime stack:
runtime.throw(0x4c2528, 0x22)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.(*mheap).allocSpanLocked(0x52dbe0, 0x1, 0xc4200ae1a0)
        /home/derek/go/src/github.com/golang/go/src/runtime/mheap.go:637 +0x498
runtime.(*mheap).alloc_m(0x52dbe0, 0x1, 0x12, 0xc420447fe0)
        /home/derek/go/src/github.com/golang/go/src/runtime/mheap.go:510 +0xd6

==> 1463603516 <==
Starting a bunch of goroutines...
acquirep: p->m=842351485952(10) p->status=1
acquirep: p->m=842350813184(4) p->status=1
fatal error: acquirep: invalid p state
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c0ab3, 0x19)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.acquirep1(0xc420015500)

==> 1463642257 <==
Starting a bunch of goroutines...
acquirep: p->m=0(0) p->status=2
fatal error: acquirep: invalid p state
acquirep: p->m=0(0) p->status=3
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c0ab3, 0x19)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.acquirep1(0xc42001c000)
aclements commented 8 years ago

Thanks! I assume these last few failures are also with master and with the two SetFinalizer calls commented out?

derekmarcotte commented 8 years ago

That's correct! Thanks. Anyone else able to reproduce?

aclements commented 8 years ago

@derekmarcotte, what version of FreeBSD and how many CPUs are you running with? I haven't had any luck yet reproducing on FreeBSD 10.1 with 2 CPUs.

derekmarcotte commented 8 years ago

@aclements: both machines were 4 core, 8 thread CPUs:

Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz (32GB/ECC) - none of the logs included were from this machine, but could not keep my process (of my project, not the listing in this issue) running for very long

AMD FX(tm)-8350 Eight-Core Processor (8GB) - this is my dev box where all the logs are included

The Xeon is running 10.3-RELEASE, and the AMD was running 10.1-RELEASE at the time of the logs (has since been upgraded to 10.3-RELEASE).

I suspect I would be able to chew through many more invocations in the same time as a 2 core machine on these hosts, and additionally increase probability of contention/collisions in any given instant.

The Xeon has since moved to production, so I don't have that hardware at my disposal for the time being, although I might be able to arrange something if it's required.

I can get dmesgs/kldstats for the xeon, and amd if helpful (would rather post out of band).

aclements commented 8 years ago

@derekmarcotte, thanks for the extra details (I see now you already gave the CPU configuration; sorry I missed that).

Two more experiments to try:

  1. See if you can reproduce with GOMAXPROCS=2 (either 1.6.2 or master is fine).
  2. Try the same test program, but invoking an absolute path to a command that doesn't exist (and ignoring the error). This should get the exec out of the picture.
derekmarcotte commented 8 years ago

@aclements, Thanks for your suggestions. I'm currently exploring a different option. I was using gb for building my project (sorry! forgot to mention), and additionally for this test case.

I certainly didn't expect wildly differing binaries in a project with no external dependencies, as gb uses the go compiler internally. I've got more research to do here to account for this, so I apologize for that.

I've built using go directly and am in the process of testing. So far it has been running for 12 hours, without problem (with the SetFinalizers disabled). I have had previous test runs last this long, so I'm not ready to call it a success just yet. I'll be out of town for the next few days, so I can leave it running for a while and see where it ends up.

I think this is a very promising lead, based on the objdump of the two artefacts. It might be interesting to include in the Issue Report template, with the build tool ecosystem that is currently out there (or that there is an ecosystem at all).

derekmarcotte commented 8 years ago

@aclements, rebuilding gb from source using the go based on master, and then rebuilding the test with the new gb creates nearly identical binaries (minus file path TEXT entries), this is to be expected.

Perhaps there's something to this. Will keep you posted.

josharian commented 8 years ago

cc @davecheney

derekmarcotte commented 8 years ago

@aclements, the go-only build of the binary did eventually crash... somewhere around 24 hours (~153M goroutines later), so I don't think it's gb-related. I'm going to re-build with the following, per your suggestion:

func run(done chan struct{}) {
        cmd := exec.Command("doesnotexist")
        cmd.Wait()

        done <- struct{}{}
        return
}
derekmarcotte commented 8 years ago

@alements, after the above change, it ran for 4 days without a panic.

aclements commented 8 years ago

@derekmarcotte, argh, I'm sorry. That wasn't a useful test. :( That eliminates the fork, the exec, and the finalizer, so there isn't really anything left. It needs to be something like:

func run(done chan struct{}) {
    cmd := exec.Command("/doesnotexist")
    cmd.Start()
    cmd.Wait()

    done <- struct{}{}
    return
}

It's important that the path be absolute. Otherwise, Go will try to resolve the path itself, fail at that point, and not do anything for the subsequent method calls. It's also important that the cmd.Start() still happen (even though it will fail). Again, without that, the Wait just returns nil immediately.

I'm sorry I didn't catch that earlier.

derekmarcotte commented 8 years ago

@aclements, at least we know that Go can spin for 4 days without crashing on my dev machine - so not a total waste.

I've recompiled with the above, and crashed after ~32 hours. Log attached:

doesnotexists.txt

aclements commented 8 years ago

@derekmarcotte, thanks! Was that with go or gb? With or without the SetFinalizer calls?

aclements commented 8 years ago

Also, 1.6.2 or master?

derekmarcotte commented 8 years ago

@aclements great questions. Was thinking we aught to recap all the variables at this point anyways.

Trying to narrow things down as much as possible. So, this is with go build using master (as of 7af2ce3), with this diff applied:

diff --git a/src/os/exec.go b/src/os/exec.go
index 239fd92..6a8eed5 100644
--- a/src/os/exec.go
+++ b/src/os/exec.go
@@ -5,7 +5,6 @@
 package os

 import (
-       "runtime"
        "sync/atomic"
        "syscall"
 )
@@ -19,7 +18,6 @@ type Process struct {

 func newProcess(pid int, handle uintptr) *Process {
        p := &Process{Pid: pid, handle: handle}
-       runtime.SetFinalizer(p, (*Process).Release)
        return p
 }

diff --git a/src/os/file_unix.go b/src/os/file_unix.go
index 9b64f21..a997e9e 100644
--- a/src/os/file_unix.go
+++ b/src/os/file_unix.go
@@ -54,7 +54,6 @@ func NewFile(fd uintptr, name string) *File {
                return nil
        }
        f := &File{&file{fd: fdi, name: name}}
-       runtime.SetFinalizer(f.file, (*file).close)
        return f
 }

with source:

package main

/* stdlib includes */
import (
    "fmt"
    "os/exec"
)

func run(done chan struct{}) {
    cmd := exec.Command("/doesnotexist")
    cmd.Start()
    cmd.Wait()

    done <- struct{}{}
    return
}

func main() {
    fmt.Println("Starting a bunch of goroutines...")

    // 8 & 16 are arbitrary
    done := make(chan struct{}, 16)

    for i := 0; i < 8; i++ {
        go run(done)
    }

    for {
        select {
        case <-done:
            go run(done)
        }
    }
}
ianlancetaylor commented 8 years ago

It sounds like this is FreeBSD-specific, so marking as such. Please correct me if I am wrong.

derekmarcotte commented 8 years ago

@ianlancetaylor : thanks for this. I'm just building a linux box dedicated to testing tonight. I haven't yet been able to test on linux. I'll update the thread as I find more information out.

kaey commented 8 years ago

Looks like i'm having the same issue with github.com/influxdb/telegraf. Panic log attached.

acquirep: p->m=859531515904(14) p->status=1
fatal error: acquirep: invalid p state
acquirep: p->m=859535287296(8) p->status=1
fatal error: acquirep: invalid p state
% uname -a
FreeBSD 9.3-RELEASE-p5 FreeBSD 9.3-RELEASE-p5 #0: Mon Nov  3 22:38:58 UTC 2014     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64
% cat /var/run/dmesg.boot | grep -i cpu
CPU: Intel(R) Xeon(R) CPU E3-1270 V2 @ 3.50GHz (3492.14-MHz K8-class CPU)
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
go version go1.6.2 freebsd/amd64

I'll try to rebuild with go tip.

panic.txt

stevenh commented 8 years ago

@derekmarcotte are your test machines metal or VM's?

I've just kicked off a test on 10.2-RELEASE box here to see if I can repro locally.

stevenh commented 8 years ago

It took many hours but I did get a panic in the end.

This is from go 1.6.2 (no patches) on 10.2-RELEASE amd64 (metal) no Virtualisation

./test 
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.SetFinalizer.func2()
        /usr/local/go/src/runtime/mfinal.go:372 +0x6f

goroutine 23221055 [running]:
runtime.systemstack_switch()
        /usr/local/go/src/runtime/asm_amd64.s:245 fp=0xc8202eca58 sp=0xc8202eca50
runtime.SetFinalizer(0x4f2f00, 0xc8204a19e0, 0x4d68a0, 0x54f550)
        /usr/local/go/src/runtime/mfinal.go:374 +0x4b5 fp=0xc8202ecbd8 sp=0xc8202eca58
os.NewFile(0x3, 0x51efd0, 0x9, 0x323200000000)
        /usr/local/go/src/os/file_unix.go:57 +0xfc fp=0xc8202ecc30 sp=0xc8202ecbd8
os.OpenFile(0x51efd0, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/os/file_unix.go:123 +0x1bd fp=0xc8202ecca8 sp=0xc8202ecc30
os.Open(0x51efd0, 0x9, 0xdeaddeaddeaddead, 0x0, 0x0)
        /usr/local/go/src/os/file.go:244 +0x48 fp=0xc8202ecce8 sp=0xc8202ecca8
os/exec.(*Cmd).stdin(0xc8203eac80, 0x0, 0x0, 0x0)
        /usr/local/go/src/os/exec/exec.go:171 +0x6e fp=0xc8202ecd80 sp=0xc8202ecce8
os/exec.(*Cmd).Start(0xc8203eac80, 0x0, 0x0)
        /usr/local/go/src/os/exec/exec.go:316 +0x2f4 fp=0xc8202ecf68 sp=0xc8202ecd80
main.run(0xc82006c060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x50 fp=0xc8202ecfa8 sp=0xc8202ecf68
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc8202ecfb0 sp=0xc8202ecfa8
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:31 +0x19b
kaey commented 8 years ago

are your test machines metal or VM's?

For me, telegraf crashes on both vmware VMs and baremetal machines.

derekmarcotte commented 8 years ago

@stevenh: So it's random, and I think the probability of crashing is a function of the number of cpus being used, and while I say 8 is arbitrary, it's also the number of cpus I had on hand. If you have more, by all means, set it higher.

Try it a few times, and you'll see. I just stuck it in a loop like while true; do date; ./ex-from-comment-225006179; done, and then you can see them over a day period.

Anecdotally, I've been able to run a longer-lived process with GOMAXPROCS=2, but will be doing more thorough testing.

Most of those dumps are from a jail running on bare metal. The Xeon was straight on the jail host (although there were jails running).

derekmarcotte commented 8 years ago

@stevenh: Oh, and the one that uses /doesnotexist crashes after 32 hours, but the one that uses true is much sooner (because it appears there are multiple issues going on).

stevenh commented 8 years ago

With GOMAXPROC=1 and "true" its been running for 20+ hours so far, so it does indeed to seem to be effected by the number of procs available.

derekmarcotte commented 8 years ago

I've been running /doesnotexist with stock go 1.6.2 for just shy of a week with GOMAXPROCS="2". Interesting to note, there are 11 OS-threads, which is surprising to me, but may be expected?

derekmarcotte commented 8 years ago

I've been able to cause the panic much quicker being more aggressive with the garbage collector. I've been using this script to test:

while true; do 
  echo "==== NEW RUN $(date) ===="
  echo
  time sh -c 'export GOGC=5; export GODEBUG=gctrace=2,schedtrace=100; ./15658-doesnotexit'
  echo
  echo
done >> ../logs/15658-doesnotexit-logs-gogc5-gtrace2 2>> ../logs/15658-doesnotexit-logs-gogc5-gtrace2

Here's a sample of the crashing interval:

==== NEW RUN Tue Aug 30 16:36:11 EDT 2016 ==== ==== NEW RUN Tue Aug 30 16:37:18 EDT 2016 ==== ==== NEW RUN Tue Aug 30 17:13:02 EDT 2016 ==== ==== NEW RUN Tue Aug 30 17:28:30 EDT 2016 ==== ==== NEW RUN Tue Aug 30 17:54:22 EDT 2016 ==== ==== NEW RUN Tue Aug 30 18:04:29 EDT 2016 ==== ==== NEW RUN Tue Aug 30 18:36:36 EDT 2016 ==== ==== NEW RUN Tue Aug 30 18:57:48 EDT 2016 ==== ==== NEW RUN Tue Aug 30 19:09:12 EDT 2016 ==== ==== NEW RUN Tue Aug 30 19:37:33 EDT 2016 ==== ==== NEW RUN Tue Aug 30 19:42:37 EDT 2016 ==== ==== NEW RUN Tue Aug 30 19:52:31 EDT 2016 ==== ==== NEW RUN Tue Aug 30 20:03:22 EDT 2016 ==== ==== NEW RUN Tue Aug 30 20:07:05 EDT 2016 ==== ==== NEW RUN Tue Aug 30 20:10:31 EDT 2016 ==== ==== NEW RUN Tue Aug 30 20:25:37 EDT 2016 ==== ==== NEW RUN Tue Aug 30 20:31:00 EDT 2016 ==== ==== NEW RUN Tue Aug 30 20:31:28 EDT 2016 ==== ==== NEW RUN Tue Aug 30 20:34:30 EDT 2016 ==== ==== NEW RUN Tue Aug 30 20:38:16 EDT 2016 ==== ==== NEW RUN Tue Aug 30 20:40:09 EDT 2016 ==== ==== NEW RUN Tue Aug 30 20:56:57 EDT 2016 ==== ==== NEW RUN Tue Aug 30 21:31:54 EDT 2016 ==== ==== NEW RUN Tue Aug 30 21:39:05 EDT 2016 ==== ==== NEW RUN Tue Aug 30 21:51:06 EDT 2016 ==== ==== NEW RUN Tue Aug 30 22:28:44 EDT 2016 ==== ==== NEW RUN Tue Aug 30 22:32:35 EDT 2016 ==== ==== NEW RUN Tue Aug 30 22:53:50 EDT 2016 ==== ==== NEW RUN Tue Aug 30 22:55:44 EDT 2016 ==== ==== NEW RUN Tue Aug 30 22:57:44 EDT 2016 ==== ==== NEW RUN Tue Aug 30 22:58:08 EDT 2016 ==== ==== NEW RUN Tue Aug 30 22:59:58 EDT 2016 ==== ==== NEW RUN Tue Aug 30 23:52:15 EDT 2016 ==== ==== NEW RUN Wed Aug 31 00:28:37 EDT 2016 ==== ==== NEW RUN Wed Aug 31 00:46:25 EDT 2016 ==== ==== NEW RUN Wed Aug 31 00:53:03 EDT 2016 ==== ==== NEW RUN Wed Aug 31 00:57:34 EDT 2016 ==== ==== NEW RUN Wed Aug 31 01:06:02 EDT 2016 ==== ==== NEW RUN Wed Aug 31 01:13:20 EDT 2016 ==== ==== NEW RUN Wed Aug 31 01:21:55 EDT 2016 ==== ==== NEW RUN Wed Aug 31 02:17:49 EDT 2016 ==== ==== NEW RUN Wed Aug 31 02:55:35 EDT 2016 ==== ==== NEW RUN Wed Aug 31 03:43:36 EDT 2016 ==== ==== NEW RUN Wed Aug 31 03:47:24 EDT 2016 ==== ==== NEW RUN Wed Aug 31 03:48:14 EDT 2016 ==== ==== NEW RUN Wed Aug 31 03:50:32 EDT 2016 ==== ==== NEW RUN Wed Aug 31 03:59:05 EDT 2016 ==== ==== NEW RUN Wed Aug 31 04:43:24 EDT 2016 ==== ==== NEW RUN Wed Aug 31 04:55:07 EDT 2016 ==== ==== NEW RUN Wed Aug 31 05:04:41 EDT 2016 ==== ==== NEW RUN Wed Aug 31 05:34:44 EDT 2016 ==== ==== NEW RUN Wed Aug 31 05:50:45 EDT 2016 ==== ==== NEW RUN Wed Aug 31 05:53:55 EDT 2016 ==== ==== NEW RUN Wed Aug 31 05:58:08 EDT 2016 ==== ==== NEW RUN Wed Aug 31 06:19:07 EDT 2016 ==== ==== NEW RUN Wed Aug 31 06:34:13 EDT 2016 ==== ==== NEW RUN Wed Aug 31 06:41:30 EDT 2016 ==== ==== NEW RUN Wed Aug 31 07:52:08 EDT 2016 ==== ==== NEW RUN Wed Aug 31 07:54:32 EDT 2016 ==== ==== NEW RUN Wed Aug 31 07:56:02 EDT 2016 ==== ==== NEW RUN Wed Aug 31 07:59:04 EDT 2016 ==== ==== NEW RUN Wed Aug 31 07:59:48 EDT 2016 ==== ==== NEW RUN Wed Aug 31 08:01:46 EDT 2016 ==== ==== NEW RUN Wed Aug 31 08:06:30 EDT 2016 ==== ==== NEW RUN Wed Aug 31 08:23:03 EDT 2016 ==== ==== NEW RUN Wed Aug 31 08:29:57 EDT 2016 ==== ==== NEW RUN Wed Aug 31 08:41:58 EDT 2016 ==== ==== NEW RUN Wed Aug 31 08:43:14 EDT 2016 ==== ==== NEW RUN Wed Aug 31 08:45:00 EDT 2016 ==== ==== NEW RUN Wed Aug 31 08:59:22 EDT 2016 ==== ==== NEW RUN Wed Aug 31 09:24:11 EDT 2016 ==== ==== NEW RUN Wed Aug 31 10:01:22 EDT 2016 ==== ==== NEW RUN Wed Aug 31 10:05:45 EDT 2016 ==== ==== NEW RUN Wed Aug 31 10:09:24 EDT 2016 ==== ==== NEW RUN Wed Aug 31 10:12:51 EDT 2016 ==== ==== NEW RUN Wed Aug 31 10:23:35 EDT 2016 ==== ^^^^ process was idle/wedged, sent kill manually ==== NEW RUN Thu Sep 1 06:47:27 EDT 2016 ==== ==== NEW RUN Thu Sep 1 06:49:27 EDT 2016 ====

$ go version
go version go1.7 freebsd/amd64

$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="freebsd"
GOOS="freebsd"
GOPATH="/home/derek/dev/gopath"
GORACE=""
GOROOT="/home/derek/go"
GOTOOLDIR="/home/derek/go/pkg/tool/freebsd_amd64"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -gno-record-gcc-switches"
CXX="clang++"
CGO_ENABLED="1"
reezer commented 8 years ago

I am trying to reproduce this on DragonFly BSD. So far I was able to reproduce the idle state:

localhost% uname -a
DragonFly localhost.localdomain 4.6-RELEASE DragonFly v4.6.0-RELEASE #0: Mon Aug  1 12:46:25 EDT 2016     root@www.shiningsilence.com:/usr/obj/build/sys/X86_64_GENERIC  x86_64
localhost% go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="dragonfly"
GOOS="dragonfly"
GOPATH=""
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/dragonfly_amd64"
CC="cc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build276771230=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
localhost% go version
go version go1.7 dragonfly/amd64

SCHED 783590ms: gomaxprocs=2 idleprocs=0 threads=8 spinningthreads=0 idlethreads=5 runqueue=8 [0 1]

I have been running this test inside Virtualbox, with 2 virtual CPUs.

reezer commented 8 years ago

On DragonFly it's actually causing kernel panic. Therefor I created a bug report there. It includes the dump.

https://bugs.dragonflybsd.org/issues/2949

rsc commented 8 years ago

Cannot reproduce on freebsd-amd64-gce101 gomote running FreeBSD 10.1. The Dragonfly panic is suggestive that there may be a similar kernel problem in FreeBSD. Unlikely we are going to solve this.

derekmarcotte commented 8 years ago

Thanks very much for taking a look. I'm very grateful for your time.

Are you able to elaborate more fully about what you mean by "Unlikely we are going to solve this." Are you referring to the issue in general, or the DragonFly kernel crash? Is the "we" The Go Team?

I believe the implication of a kernel problem in FreeBSD, because of a panic in DragonFly, simply doesn't follow. I can give a lot more detail about this, but I fear it's redundant, unless requested.

For what it's worth, this issue is reliably reproducible enough that I've been able to bisect down to a commit that makes sense (https://github.com/golang/go/commit/e6d8bfe218a5f387d7aceddcaee5067a59181838), in my first go. Additionally, the commits leading up to the implicated commit in the bisect are documentation-only, and don't reproduce the behaviour. I have very high confidence in the bisect.

I haven't yet published this work yet, as I've been asking some new questions based on the bisect. I was hoping to be able to speak intelligently about this when I reported results, but I feel now the pressure has increased. I've been making progress slowly, but I am making progress (this isn't my daytime gig). I will post a repo later today with the work and the bisec log, so that others can reference it.

Here's my thoughts:

New lines of thought are:

I plan to investigate these questions, but again, this isn't my daytime gig.

As I mentioned earlier, I'm very happy to work with others on this.

I'd be very interested in more details in the environment you were unable to reproduce in. Specifically, how many CPUs? Is the gomote a virtualized environment, or do you have direct access to hardware somewhere? Perhaps we can compare build artefacts? i.e. does your artefact crash in my environment, and vice versa

Thanks again for taking a look.

bradfitz commented 8 years ago

@derekmarcotte, thanks for investigating. gomote is our interface to freshly-created FreeBSD VMs on GCE.

See https://github.com/golang/build/blob/master/env/freebsd-amd64/make.bash https://github.com/golang/build/blob/master/dashboard/builders.go#L81

They're both n1-highcpu-4, which is https://cloud.google.com/compute/docs/machine-types#standard_machine_types ... "Standard 4 CPU machine type with 4 virtual CPUs and 15 GB of memory. ... 1For the n1 series of machine types, a virtual CPU is implemented as a single hardware hyper-thread on a 2.6 GHz Intel Xeon E5 (Sandy Bridge), 2.5 GHz Intel Xeon E5 v2 (Ivy Bridge), 2.3 GHz Intel Xeon E5 v3 (Haswell), or 2.2 GHz Intel Xeon E5 v4 (Broadwell)"

So it's at least 2 cores.

ianlancetaylor commented 8 years ago

@derekmarcotte Thanks for continuing to investigate. I'm fairly convinced that this is a kernel problem, and your identification of that CL as the cause just makes that seems more likely to me. I have never been able to recreate the problem myself. I suspect that it requires real hardware, not a VM.

reezer commented 8 years ago

@rsc Actually the kernel bug causing the panic existed only for a small amount of time on DragonFly. It most likely was introduced with DragonFly 4.6.0. Using 4.6.1 I fail to reproduce it. So I am sorry for laying a false trail there.

I will keep it running though, to make sure nothing happens, but I already tested it far longer than originally.

Also as a side note, since the topic of VMs came up: All my tests have been done using Virtualbox.