golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.03k stars 17.67k forks source link

runtime: FreeBSD memory corruption involving fork system call #15658

Closed derekmarcotte closed 5 years ago

derekmarcotte commented 8 years ago

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?

go version go1.6.2 freebsd/amd64

  1. What operating system and processor architecture are you using (go env)?
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="freebsd"
GOOS="freebsd"
GOPATH=""
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/freebsd_amd64"
GO15VENDOREXPERIMENT="1"
CC="cc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="clang++"
  1. What did you do?
package main

/* stdlib includes */
import (
        "fmt"
        "os/exec"
)

func run(done chan struct{}) {
        cmd := exec.Command("true")
        if err := cmd.Start(); err != nil {
                goto finished
        }

        cmd.Wait()

finished:
        done <- struct{}{}
        return
}

func main() {
        fmt.Println("Starting a bunch of goroutines...")

        // 8 & 16 are arbitrary
        done := make(chan struct{}, 16)

        for i := 0; i < 8; i++ {
                go run(done)
        }

        for {
                select {
                case <-done:
                        go run(done)
                }
        }
}
  1. What did you expect to see?

I expect this strange program to spawn instances of /bin/true in parallel, until I stop it.

  1. What did you see instead?

Various types of panics caused by what looks to be corruption within the finalizer lists, caused by what I am assuming is based on race conditions. These panics can happen as quickly as 2 minutes, or much longer. 10 minutes seems a good round number.

Occasionally addspecial gets stuck in an infinite loop holding the lock, and the process wedges. This is illustrated in log 1462933614, with x.next pointing to x. This appears to be corruption of that data structure. I have seen processes in this state run for 22 hours.

I understand there is some trepidation expressed in issue #11485 around the locking of the data structures involved.

Here are some sample messages:

1462926841-SetFinalizer-ex1.txt 1462926969-SetFinalizer-ex2.txt 1462933295-nonempty-check-fails.txt 1462933614-wedged.txt

This was run on an 8-core processor, and a 4-core 8-thread processor with ECC RAM, similar results.

Additionally, while this example is an extreme, it also represents the core functionality of a project I've been working on part-time for many months. I'm happy to provide any further assistance diagnosing this issue - I'm very invested!

derekmarcotte commented 8 years ago

Here's the repo, as promised: https://github.com/derekmarcotte/goruntime-testing

@ianlancetaylor: Thanks for checking it out. Would you be interested in running 15658-doesnotexist.go to completion? Should be on the order of 8-12 hours, maybe more? If I could suggest pulling the command line from the bisect-15658, to get the environment a bit more hostile. Check the logs directory for the bisect.

If there were a kernel problem, I would expect others to have problems with say, inetd (because it is a glorified fork/exec machine), in the wild. This would be a pretty hot code path for many production FreeBSD uses. Additionally, when looking at whether this is a kernel problem, the kernel data structures themselves would become corrupt (and panic like DragonFly), but I see no evidence of this. My main testing machine has a current uptime of 105 days, with probably hundreds of runtime panics, yet no other services have faltered, and everything behaves flawlessly.

stevenh commented 8 years ago

I've been running this now on 1.7.1 for 12h+ now on an 24 core E5-2640 running 8.2-RELEASE no hang or panic:

package main

/* stdlib includes */
import (
        "fmt"
        "os/exec"
        "runtime"
)

func run(done chan struct{}) {
        cmd := exec.Command("true")
        cmd.Start()
        cmd.Wait()

        done <- struct{}{}
        return
}

func main() {
        fmt.Println("Starting a bunch of goroutines...")
        fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))

        // 8 & 16 are arbitrary
        done := make(chan struct{}, 16)

        for i := 0; i < 8; i++ {
                go run(done)
        }

        for range done {
                go run(done)
        }
}
rsc commented 8 years ago

It's not uncommon for Go programs to behave significantly different from C programs, to the extent that they expose problems that have been latent in kernels for many years. I'm not saying I'm 100% sure that's going on here, but it's definitely possible, especially given that fork/exec seems to be required.

reezer commented 7 years ago

Using stevenh's code I received this seemingly new panic.

OS: FreeBSD 11.0 CPU: Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz (2394.61-MHz K8-class CPU)

GOGC=5 ./test
Starting a bunch of goroutines...
GOMAXPROCS: 8
runtime: failed MSpanList_Remove 0x800528320 0x8005c7d20 0x53b670 0x53b660
fatal error: MSpanList_Remove

runtime stack:
runtime.throw(0x4c973e, 0x10)
    /usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(*mSpanList).remove(0x53b660, 0x800528320)
    /usr/local/go/src/runtime/mheap.go:1001 +0x19d
runtime.(*mcentral).cacheSpan(0x53b650, 0xc42001b228)
    /usr/local/go/src/runtime/mcentral.go:55 +0x3d0
runtime.(*mcache).refill(0x800524e10, 0xc40000001e, 0xc4200b82e0)
    /usr/local/go/src/runtime/mcache.go:121 +0xae
runtime.(*mcache).nextFree.func1()
    /usr/local/go/src/runtime/malloc.go:505 +0x33
runtime.systemstack(0xc42007cee0)
    /usr/local/go/src/runtime/asm_amd64.s:298 +0x79
runtime.mstart()
    /usr/local/go/src/runtime/proc.go:1079

goroutine 643553 [running]:
runtime.systemstack_switch()
    /usr/local/go/src/runtime/asm_amd64.s:252 fp=0xc4201e68e0 sp=0xc4201e68d8
runtime.(*mcache).nextFree(0x800524e10, 0x1e, 0x0, 0x40f354, 0x80064429b)
    /usr/local/go/src/runtime/malloc.go:506 +0xb2 fp=0xc4201e6938 sp=0xc4201e68e0
runtime.mallocgc(0x380, 0x4a8020, 0x1, 0xc4200b82e0)
    /usr/local/go/src/runtime/malloc.go:658 +0x809 fp=0xc4201e69d8 sp=0xc4201e6938
runtime.makeslice(0x4a8020, 0x338, 0x338, 0xc4200b82e0, 0x1f, 0x40e996)
    /usr/local/go/src/runtime/slice.go:57 +0x7b fp=0xc4201e6a30 sp=0xc4201e69d8
syscall.ByteSliceFromString(0xc420010000, 0x337, 0xc4201064d8, 0xc4200b82e0, 0x1f, 0x0, 0x0)
    /usr/local/go/src/syscall/syscall.go:53 +0x64 fp=0xc4201e6a90 sp=0xc4201e6a30
syscall.BytePtrFromString(0xc420010000, 0x337, 0xc4200b82e0, 0x0, 0x0)
    /usr/local/go/src/syscall/syscall.go:69 +0x39 fp=0xc4201e6ad8 sp=0xc4201e6a90
syscall.SlicePtrFromStrings(0xc42000c840, 0x15, 0x15, 0xc4200881f0, 0x2, 0x2, 0x0, 0x0)
    /usr/local/go/src/syscall/exec_unix.go:87 +0x9a fp=0xc4201e6b48 sp=0xc4201e6ad8
syscall.forkExec(0xc420088120, 0xd, 0xc420088010, 0x1, 0x1, 0xc4201e6d60, 0xc420162018, 0x40e996, 0xc4201e6cc8)
    /usr/local/go/src/syscall/exec_unix.go:158 +0x16e fp=0xc4201e6c78 sp=0xc4201e6b48
syscall.StartProcess(0xc420088120, 0xd, 0xc420088010, 0x1, 0x1, 0xc4201e6d60, 0x2, 0x4, 0x44ab20, 0xc4200f0280)
    /usr/local/go/src/syscall/exec_unix.go:240 +0x64 fp=0xc4201e6cd0 sp=0xc4201e6c78
os.startProcess(0xc420088120, 0xd, 0xc420088010, 0x1, 0x1, 0xc4201e6f08, 0xc42000c840, 0x15, 0x15)
    /usr/local/go/src/os/exec_posix.go:45 +0x1a3 fp=0xc4201e6db8 sp=0xc4201e6cd0
os.StartProcess(0xc420088120, 0xd, 0xc420088010, 0x1, 0x1, 0xc4201e6f08, 0x0, 0x0, 0xd)
    /usr/local/go/src/os/doc.go:28 +0x64 fp=0xc4201e6e10 sp=0xc4201e6db8
os/exec.(*Cmd).Start(0xc42000c2c0, 0x4, 0x0)
    /usr/local/go/src/os/exec/exec.go:358 +0x3c9 fp=0xc4201e6f60 sp=0xc4201e6e10
main.run(0xc420072060)
    /home/reezer/test.go:12 +0x68 fp=0xc4201e6fa8 sp=0xc4201e6f60
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc4201e6fb0 sp=0xc4201e6fa8
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 1 [runnable]:
main.main()
    /home/reezer/test.go:30 +0x222

goroutine 643497 [runnable]:
sync.(*RWMutex).RUnlock(0x550950)
    /usr/local/go/src/sync/rwmutex.go:55
syscall.Environ(0xc4200fc000, 0x15, 0x15)
    /usr/local/go/src/syscall/env_unix.go:148 +0x210
os.Environ(0x0, 0x0, 0x0)
    /usr/local/go/src/os/env.go:116 +0x22
os/exec.(*Cmd).envv(0xc420076160, 0xc420126080, 0x0, 0x1)
    /usr/local/go/src/os/exec/exec.go:171 +0x38
os/exec.(*Cmd).Start(0xc420076160, 0x4, 0x0)
    /usr/local/go/src/os/exec/exec.go:356 +0x2d9
main.run(0xc420072060)
    /home/reezer/test.go:12 +0x68
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 643540 [runnable]:
syscall.Syscall(0x3, 0x1c, 0xc4201fabc8, 0x8, 0x0, 0x8, 0x0)
    /usr/local/go/src/syscall/asm_freebsd_amd64.s:21 +0x5
syscall.readlen(0x1c, 0xc4201fabc8, 0x8, 0x2, 0xc42010e000, 0x16)
    /usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:1303 +0x56
syscall.forkExec(0xc4201462f0, 0xd, 0xc4200740c0, 0x1, 0x1, 0xc4201fad60, 0xc4200ea020, 0x40e996, 0xc4201facc8)
    /usr/local/go/src/syscall/exec_unix.go:202 +0x39f
syscall.StartProcess(0xc4201462f0, 0xd, 0xc4200740c0, 0x1, 0x1, 0xc4201fad60, 0x2, 0x4, 0x44ab20, 0xc4200f0140)
    /usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc4201462f0, 0xd, 0xc4200740c0, 0x1, 0x1, 0xc4201faf08, 0xc420076420, 0x15, 0x15)
    /usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc4201462f0, 0xd, 0xc4200740c0, 0x1, 0x1, 0xc4201faf08, 0x0, 0x0, 0xd)
    /usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc42000c580, 0x4, 0x0)
    /usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
    /home/reezer/test.go:12 +0x68
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 643472 [runnable]:
syscall.Syscall6(0x214, 0x0, 0xeb54, 0x0, 0x18, 0x0, 0x0, 0xeb54, 0x0, 0x0)
    /usr/local/go/src/syscall/asm_freebsd_amd64.s:44 +0x5
os.(*Process).blockUntilWaitable(0xc420166090, 0xc420014060, 0xc420014060, 0x4)
    /usr/local/go/src/os/wait_wait6.go:29 +0x6b
os.(*Process).wait(0xc420166090, 0x0, 0xc4201ea160, 0x0)
    /usr/local/go/src/os/exec_unix.go:22 +0xab
os.(*Process).Wait(0xc420166090, 0xc4201e8f08, 0x0, 0x0)
    /usr/local/go/src/os/doc.go:49 +0x2b
os/exec.(*Cmd).Wait(0xc4201ea160, 0x0, 0x0)
    /usr/local/go/src/os/exec/exec.go:434 +0x6d
main.run(0xc420072060)
    /home/reezer/test.go:13 +0x76
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 643498 [semacquire]:
sync.runtime_Semacquire(0x550934)
    /usr/local/go/src/runtime/sema.go:47 +0x30
sync.(*Mutex).Lock(0x550930)
    /usr/local/go/src/sync/mutex.go:85 +0xd0
sync.(*RWMutex).Lock(0x550930)
    /usr/local/go/src/sync/rwmutex.go:86 +0x31
syscall.forkExec(0xc420084270, 0xd, 0xc420146000, 0x1, 0x1, 0xc4201f7d60, 0xc420080030, 0x40e996, 0xc4201f7cc8)
    /usr/local/go/src/syscall/exec_unix.go:185 +0x25e
syscall.StartProcess(0xc420084270, 0xd, 0xc420146000, 0x1, 0x1, 0xc4201f7d60, 0x2, 0x4, 0x44ab20, 0xc420108100)
    /usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc420084270, 0xd, 0xc420146000, 0x1, 0x1, 0xc4201f7f08, 0xc420132000, 0x15, 0x15)
    /usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc420084270, 0xd, 0xc420146000, 0x1, 0x1, 0xc4201f7f08, 0x0, 0x0, 0xd)
    /usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc420044420, 0x4, 0x0)
    /usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
    /home/reezer/test.go:12 +0x68
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 643440 [runnable]:
syscall.Syscall(0x3, 0xc, 0xc4201e7bc8, 0x8, 0x0, 0x8, 0x0)
    /usr/local/go/src/syscall/asm_freebsd_amd64.s:21 +0x5
syscall.readlen(0xc, 0xc4201e7bc8, 0x8, 0x2, 0xc420106160, 0x16)
    /usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:1303 +0x56
syscall.forkExec(0xc42007e170, 0xd, 0xc42007e000, 0x1, 0x1, 0xc4201e7d60, 0xc420080018, 0x40e996, 0xc4201e7cc8)
    /usr/local/go/src/syscall/exec_unix.go:202 +0x39f
syscall.StartProcess(0xc42007e170, 0xd, 0xc42007e000, 0x1, 0x1, 0xc4201e7d60, 0x2, 0x4, 0x44ab20, 0xc4200f0140)
    /usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc42007e170, 0xd, 0xc42007e000, 0x1, 0x1, 0xc4201e7f08, 0xc420132420, 0x15, 0x15)
    /usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc42007e170, 0xd, 0xc42007e000, 0x1, 0x1, 0xc4201e7f08, 0x0, 0x0, 0xd)
    /usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc42000c420, 0x4, 0x0)
    /usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
    /home/reezer/test.go:12 +0x68
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 643499 [runnable]:
syscall.fcntl(0x4, 0x2, 0x1, 0x0, 0x0, 0xc4c0000000)
    /usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:255
syscall.forkExecPipe(0xc4201fbc18, 0x2, 0x2, 0xc42018e8f0, 0x16)
    /usr/local/go/src/syscall/exec_bsd.go:264 +0x87
syscall.forkExec(0xc4201460c0, 0xd, 0xc420146010, 0x1, 0x1, 0xc4201fbd60, 0xc420148040, 0x40e996, 0xc4201fbcc8)
    /usr/local/go/src/syscall/exec_unix.go:188 +0x281
syscall.StartProcess(0xc4201460c0, 0xd, 0xc420146010, 0x1, 0x1, 0xc4201fbd60, 0x2, 0x4, 0x44ab20, 0xc4200f00c0)
    /usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc4201460c0, 0xd, 0xc420146010, 0x1, 0x1, 0xc4201fbf08, 0xc4200446e0, 0x15, 0x15)
    /usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc4201460c0, 0xd, 0xc420146010, 0x1, 0x1, 0xc4201fbf08, 0x0, 0x0, 0xd)
    /usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc420044580, 0x4, 0x0)
    /usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
    /home/reezer/test.go:12 +0x68
created by main.main
    /home/reezer/test.go:31 +0x24d
ianlancetaylor commented 7 years ago

@reezer Which version of Go?

reezer commented 7 years ago

go1.7.1 from the FreeBSD package

bradfitz commented 7 years ago

Go 1.7.1 has two notable bugs you could be hitting. See https://github.com/golang/go/issues?q=milestone%3AGo1.7.3

Can you try Go 1.7.3 or tip instead?

reezer commented 7 years ago

I will still test it longer, but so far it appears to be fixed.

Again, I am sorry for the noise. It was a test system I quickly set up and should have taken greater care to make sure it runs the latest version.

reezer commented 7 years ago

Okay, it takes a while on go1.7.3 freebsd/amd64 (FreeBSD 11.0), but I again receive one of the original errors. For people who want to try to reproduce it. GOGC=5 makes a big difference in the time it takes until the error appears.

cat out.log
Starting a bunch of goroutines...
GOMAXPROCS: 8
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty
fatal error: workbuf is not empty

runtime stack:
runtime.throw(0x4ca671, 0x14)
    /usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(*workbuf).checkempty(0x800585000)
    /usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x80052a260)
    /usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.handoff(0x800585000, 0xc41fff49b0)
    /usr/local/go/src/runtime/mgcwork.go:419 +0x26
runtime.(*gcWork).balance(0xc420019c28)
    /usr/local/go/src/runtime/mgcwork.go:263 +0x54
runtime.gcDrainN(0xc420019c28, 0x10000, 0xc4fffffffe)
    /usr/local/go/src/runtime/mgcmark.go:1042 +0x195
runtime.gcAssistAlloc.func1()
    /usr/local/go/src/runtime/mgcmark.go:462 +0xe5
runtime.systemstack(0xc42017c380)
    /usr/local/go/src/runtime/asm_amd64.s:298 +0x79
runtime.mstart()
    /usr/local/go/src/runtime/proc.go:1079

goroutine 681244 [running]:
runtime.systemstack_switch()
    /usr/local/go/src/runtime/asm_amd64.s:252 fp=0xc42014b8b0 sp=0xc42014b8a8
runtime.gcAssistAlloc(0xc420129520)
    /usr/local/go/src/runtime/mgcmark.go:499 +0x194 fp=0xc42014b938 sp=0xc42014b8b0
runtime.mallocgc(0x7, 0x4a80a0, 0x1, 0xc42000e220)
    /usr/local/go/src/runtime/malloc.go:563 +0x9a4 fp=0xc42014b9d8 sp=0xc42014b938
runtime.makeslice(0x4a80a0, 0x7, 0x7, 0xc42000e220, 0x16, 0x40e996)
    /usr/local/go/src/runtime/slice.go:57 +0x7b fp=0xc42014ba30 sp=0xc42014b9d8
syscall.ByteSliceFromString(0xc42000a010, 0x6, 0xc4200a01a0, 0xc42000e220, 0x16, 0x0, 0x0)
    /usr/local/go/src/syscall/syscall.go:53 +0x64 fp=0xc42014ba90 sp=0xc42014ba30
syscall.BytePtrFromString(0xc42000a010, 0x6, 0xc42000e220, 0x0, 0x0)
    /usr/local/go/src/syscall/syscall.go:69 +0x39 fp=0xc42014bad8 sp=0xc42014ba90
syscall.SlicePtrFromStrings(0xc42012c300, 0x18, 0x18, 0xc42013c050, 0x2, 0x2, 0x0, 0x0)
    /usr/local/go/src/syscall/exec_unix.go:87 +0x9a fp=0xc42014bb48 sp=0xc42014bad8
syscall.forkExec(0xc4200ec150, 0xd, 0xc4200ec050, 0x1, 0x1, 0xc42014bd60, 0xc42010a008, 0x40e996, 0xc42014bcc8)
    /usr/local/go/src/syscall/exec_unix.go:158 +0x16e fp=0xc42014bc78 sp=0xc42014bb48
syscall.StartProcess(0xc4200ec150, 0xd, 0xc4200ec050, 0x1, 0x1, 0xc42014bd60, 0x2, 0x4, 0x44ab30, 0xc4201ce280)
    /usr/local/go/src/syscall/exec_unix.go:240 +0x64 fp=0xc42014bcd0 sp=0xc42014bc78
os.startProcess(0xc4200ec150, 0xd, 0xc4200ec050, 0x1, 0x1, 0xc42014bf08, 0xc42012c300, 0x18, 0x18)
    /usr/local/go/src/os/exec_posix.go:45 +0x1a3 fp=0xc42014bdb8 sp=0xc42014bcd0
os.StartProcess(0xc4200ec150, 0xd, 0xc4200ec050, 0x1, 0x1, 0xc42014bf08, 0x0, 0x0, 0xd)
    /usr/local/go/src/os/doc.go:28 +0x64 fp=0xc42014be10 sp=0xc42014bdb8
os/exec.(*Cmd).Start(0xc4201962c0, 0x4, 0x0)
    /usr/local/go/src/os/exec/exec.go:358 +0x3c9 fp=0xc42014bf60 sp=0xc42014be10
main.run(0xc420072060)
    /home/reezer/test.go:12 +0x68 fp=0xc42014bfa8 sp=0xc42014bf60
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42014bfb0 sp=0xc42014bfa8
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 1 [chan receive]:
main.main()
    /home/reezer/test.go:30 +0x222

goroutine 681182 [semacquire]:
syscall.forkAndExecInChild(0xc42012e150, 0xc42012e160, 0x2, 0x2, 0xc420194000, 0x19, 0x19, 0x0, 0x0, 0xc4201c8d60, ...)
    /usr/local/go/src/syscall/exec_bsd.go:54 +0x5e
syscall.forkExec(0xc42012e0c0, 0xd, 0xc42013c100, 0x1, 0x1, 0xc4201c8d60, 0xc42013a010, 0x40e996, 0xc4201c8cc8)
    /usr/local/go/src/syscall/exec_unix.go:193 +0x334
syscall.StartProcess(0xc42012e0c0, 0xd, 0xc42013c100, 0x1, 0x1, 0xc4201c8d60, 0x2, 0x4, 0x44ab30, 0xc420134000)
    /usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc42012e0c0, 0xd, 0xc42013c100, 0x1, 0x1, 0xc4201c8f08, 0xc420076180, 0x18, 0x18)
    /usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc42012e0c0, 0xd, 0xc42013c100, 0x1, 0x1, 0xc4201c8f08, 0x0, 0x0, 0xd)
    /usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc4201b2420, 0x4, 0x0)
    /usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
    /home/reezer/test.go:12 +0x68
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 681184 [semacquire]:
os/exec.(*Cmd).Start(0xc4201d2000, 0x4, 0x0)
    /usr/local/go/src/os/exec/exec.go:367 +0x466
main.run(0xc420072060)
    /home/reezer/test.go:12 +0x68
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 681213 [semacquire]:
os.(*Process).wait(0xc42010e150, 0x0, 0xc4201b22c0, 0x0)
    /usr/local/go/src/os/exec_unix.go:48 +0x17a
os.(*Process).Wait(0xc42010e150, 0xc4201adf08, 0x0, 0x0)
    /usr/local/go/src/os/doc.go:49 +0x2b
os/exec.(*Cmd).Wait(0xc4201b22c0, 0x0, 0x0)
    /usr/local/go/src/os/exec/exec.go:434 +0x6d
main.run(0xc420072060)
    /home/reezer/test.go:13 +0x76
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 681183 [semacquire]:
os.newProcess(0x14bf2, 0x0, 0xc4200ec010)
    /usr/local/go/src/os/exec.go:23 +0x31
os.startProcess(0xc42000a4d0, 0xd, 0xc4200ec010, 0x1, 0x1, 0xc4201abf08, 0xc42000c780, 0x18, 0x18)
    /usr/local/go/src/os/exec_posix.go:49 +0x2c0
os.StartProcess(0xc42000a4d0, 0xd, 0xc4200ec010, 0x1, 0x1, 0xc4201abf08, 0x0, 0x0, 0xd)
    /usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc420196160, 0x4, 0x0)
    /usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
    /home/reezer/test.go:12 +0x68
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 681161 [running]:
    goroutine running on other thread; stack unavailable
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 681214 [semacquire]:
syscall.ByteSliceFromString(0xc42000a010, 0x6, 0xc4200bc080, 0x16, 0x16, 0x0, 0x0)
    /usr/local/go/src/syscall/syscall.go:53 +0x64
syscall.BytePtrFromString(0xc42000a010, 0x6, 0xc4200bc080, 0x0, 0x0)
    /usr/local/go/src/syscall/syscall.go:69 +0x39
syscall.SlicePtrFromStrings(0xc420180180, 0x18, 0x18, 0xc420198290, 0x2, 0x2, 0x0, 0x0)
    /usr/local/go/src/syscall/exec_unix.go:87 +0x9a
syscall.forkExec(0xc4201981e0, 0xd, 0xc420198130, 0x1, 0x1, 0xc4201c7d60, 0xc4201b8068, 0x40e996, 0xc4201c7cc8)
    /usr/local/go/src/syscall/exec_unix.go:158 +0x16e
syscall.StartProcess(0xc4201981e0, 0xd, 0xc420198130, 0x1, 0x1, 0xc4201c7d60, 0x2, 0x4, 0x44ab30, 0xc42004e4c0)
    /usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc4201981e0, 0xd, 0xc420198130, 0x1, 0x1, 0xc4201c7f08, 0xc420180180, 0x18, 0x18)
    /usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc4201981e0, 0xd, 0xc420198130, 0x1, 0x1, 0xc4201c7f08, 0x0, 0x0, 0xd)
    /usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc4201b2160, 0x4, 0x0)
    /usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc420072060)
    /home/reezer/test.go:12 +0x68
created by main.main
    /home/reezer/test.go:31 +0x24d

goroutine 681162 [running]:
    goroutine running on other thread; stack unavailable
created by main.main
    /home/reezer/test.go:31 +0x24d

runtime stack:
runtime.throw(0x4ca671, 0x14)
    /usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(*workbuf).checkempty(0x800585000)
    /usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x0)
    /usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.(*gcWork).init(0xc420017228)
    /usr/local/go/src/runtime/mgcwork.go:91 +0x54
runtime.(*gcWork).put(0xc420017228, 0xc42017d380)
    /usr/local/go/src/runtime/mgcwork.go:102 +0xb9
runtime.greyobject(0xc42017d380, 0x0, 0x0, 0xc41fff4163, 0xc400000000, 0x800765860, 0xc420017228, 0xc)
    /usr/local/go/src/runtime/mgcmark.go:1248 +0x12d
runtime.shade(0xc42017d380)
    /usr/local/go/src/runtime/mgcmark.go:1185 +0xb3
runtime.gcmarkwb_m(0xc4200944a0, 0xc42017d380)
    /usr/local/go/src/runtime/mbarrier.go:105 +0x5b
runtime.writebarrierptr_nostore1.func1()
    /usr/local/go/src/runtime/mbarrier.go:131 +0x64
runtime.systemstack(0xc42003de70)
    /usr/local/go/src/runtime/asm_amd64.s:314 +0xab
runtime.writebarrierptr_nostore1(0xc4200944a0, 0xc42017d380)
    /usr/local/go/src/runtime/mbarrier.go:132 +0xb6
runtime.writebarrierptr(0xc4200944a0, 0xc42017d380)
    /usr/local/go/src/runtime/mbarrier.go:154 +0x4d
runtime.execute(0xc42017d380, 0x0)
    /usr/local/go/src/runtime/proc.go:1815 +0x160
runtime.schedule()
    /usr/local/go/src/runtime/proc.go:2137 +0x127
runtime.park_m(0xc420129040)
    /usr/local/go/src/runtime/proc.go:2183 +0x123
runtime.mcall(0xc4200001a0)
    /usr/local/go/src/runtime/asm_amd64.s:240 +0x5b

runtime stack:
runtime.throw(0x4ca671, 0x14)
    /usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(*workbuf).checkempty(0x800585000)
    /usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x0)
    /usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.(*gcWork).init(0xc420018728)
    /usr/local/go/src/runtime/mgcwork.go:91 +0x54
runtime.(*gcWork).tryGet(0xc420018728, 0x5364d0)
    /usr/local/go/src/runtime/mgcwork.go:144 +0xd9
runtime.gcDrainN(0xc420018728, 0x10000, 0xfffffffd)
    /usr/local/go/src/runtime/mgcmark.go:1052 +0x145
runtime.gcAssistAlloc.func1()
    /usr/local/go/src/runtime/mgcmark.go:462 +0xe5
runtime.systemstack(0xc420018a00)
    /usr/local/go/src/runtime/asm_amd64.s:298 +0x79
runtime.mstart()
    /usr/local/go/src/runtime/proc.go:1079

runtime stack:
runtime.throw(0x4ca671, 0x14)
    /usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(*workbuf).checkempty(0x800585000)
    /usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x0)
    /usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.(*gcWork).init(0xc42001b228)
    /usr/local/go/src/runtime/mgcwork.go:91 +0x54
runtime.(*gcWork).put(0xc42001b228, 0xc420153860)
    /usr/local/go/src/runtime/mgcwork.go:102 +0xb9
runtime.greyobject(0xc420153860, 0x0, 0x0, 0xc41fff563c, 0xc400000000, 0x80052b2a0, 0xc42001b228, 0xf)
    /usr/local/go/src/runtime/mgcmark.go:1248 +0x12d
runtime.shade(0xc420153860)
    /usr/local/go/src/runtime/mgcmark.go:1185 +0xb3
runtime.gcmarkwb_m(0xc4200950a0, 0xc420153860)
    /usr/local/go/src/runtime/mbarrier.go:105 +0x5b
runtime.writebarrierptr_nostore1.func1()
    /usr/local/go/src/runtime/mbarrier.go:131 +0x64
runtime.systemstack(0xc4200ade70)
    /usr/local/go/src/runtime/asm_amd64.s:314 +0xab
runtime.writebarrierptr_nostore1(0xc4200950a0, 0xc420153860)
    /usr/local/go/src/runtime/mbarrier.go:132 +0xb6
runtime.writebarrierptr(0xc4200950a0, 0xc420153860)
    /usr/local/go/src/runtime/mbarrier.go:154 +0x4d
runtime.execute(0xc420153860, 0x0)
    /usr/local/go/src/runtime/proc.go:1815 +0x160
runtime.schedule()
    /usr/local/go/src/runtime/proc.go:2137 +0x127
runtime.park_m(0xc420001380)
    /usr/local/go/src/runtime/proc.go:2183 +0x123
runtime.mcall(0x0)
    /usr/local/go/src/runtime/asm_amd64.s:240 +0x5b

runtime stack:
runtime.throw(0x4ca671, 0x14)
    /usr/local/go/src/runtime/panic.go:566 +0x95
runtime.(*workbuf).checkempty(0x800585000)
    /usr/local/go/src/runtime/mgcwork.go:301 +0x4e
runtime.getempty(0x0)
    /usr/local/go/src/runtime/mgcwork.go:313 +0x96
runtime.(*gcWork).init(0xc42001c728)
    /usr/local/go/src/runtime/mgcwork.go:91 +0x54
runtime.(*gcWork).put(0xc42001c728, 0xc42017d040)
    /usr/local/go/src/runtime/mgcwork.go:102 +0xb9
runtime.greyobject(0xc42017d040, 0x0, 0x0, 0xc41fff417d, 0xc400000000, 0x800765860, 0xc42001c728, 0xa)
    /usr/local/go/src/runtime/mgcmark.go:1248 +0x12d
runtime.shade(0xc42017d040)
    /usr/local/go/src/runtime/mgcmark.go:1185 +0xb3
runtime.gcmarkwb_m(0xc42002c8a0, 0xc42017d040)
    /usr/local/go/src/runtime/mbarrier.go:105 +0x5b
runtime.writebarrierptr_nostore1.func1()
    /usr/local/go/src/runtime/mbarrier.go:131 +0x64
runtime.systemstack(0xc420039e70)
    /usr/local/go/src/runtime/asm_amd64.s:314 +0xab
runtime.writebarrierptr_nostore1(0xc42002c8a0, 0xc42017d040)
    /usr/local/go/src/runtime/mbarrier.go:132 +0xb6
runtime.writebarrierptr(0xc42002c8a0, 0xc42017d040)
    /usr/local/go/src/runtime/mbarrier.go:154 +0x4d
runtime.execute(0xc42017d040, 0x0)
    /usr/local/go/src/runtime/proc.go:1815 +0x160
runtime.schedule()
    /usr/local/go/src/runtime/proc.go:2137 +0x127
runtime.park_m(0xc420045040)
    /usr/local/go/src/runtime/proc.go:2183 +0x123
runtime.mcall(0x0)
    /usr/local/go/src/runtime/asm_amd64.s:240 +0x5b
stevenh commented 7 years ago

@reezer how long did it take to reproduce for you with GOGC=5?

reezer commented 7 years ago

It varied quite a bit, sometimes minutes, but up to about two hours. It does really needs the forking. What also speeds it up (very slightly) is using an absolute path for the binary. Go otherwise will iterate over all of the the PATH directories, trying to execute true, so you have a few unnecessary syscalls. Of course you should also make sure that you really have GOMAXPROCS set to something > 1.

See also the timings that derekmarcotte posted. Mine have always been higher, but not that much.

It looks that this bug doesn't appear in virtual environments. I agree that there is quite a bit of evidence that it's a bug somewhere outside of Go (ie. kernel). Another hint here is that it apparently takes longest the first time you run this. However take this statement with a grain of salt, cause I didn't really sample this enough yet.

stevenh commented 7 years ago

Thanks, reproduced now :)

stevenh commented 7 years ago

Here's a possibly interesting one:-


GOMAXPROCS: 24
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [chan receive]:
main.main()
        /data/go/src/github.com/multiplay/go/apps/test/main.go:29 +0x222

goroutine 3560607 [semacquire]:
syscall.ByteSliceFromString(0x4c8a13, 0x9, 0x41aaf7, 0x800574199, 0x800574120, 0x4d, 0x4d)
        /usr/local/go/src/syscall/syscall.go:53 +0x64
syscall.BytePtrFromString(0x4c8a13, 0x9, 0x10, 0x4b0f60, 0x4aedc0)
        /usr/local/go/src/syscall/syscall.go:69 +0x39
syscall.Open(0x4c8a13, 0x9, 0x100001, 0xc400000000, 0x40d8de, 0xc420326000, 0x8)
        /usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:820 +0x3f
os.OpenFile(0x4c8a13, 0x9, 0x1, 0x800000000, 0x0, 0x0, 0xc42029ee00)
        /usr/local/go/src/os/file_unix.go:97 +0xa7
os/exec.(*Cmd).writerDescriptor(0xc420258000, 0x0, 0x0, 0x1, 0xc42026a4d0, 0xc420347e48)
        /usr/local/go/src/os/exec/exec.go:233 +0x569
os/exec.(*Cmd).stdout(0xc420258000, 0x0, 0x0, 0x0)
        /usr/local/go/src/os/exec/exec.go:221 +0x3d
os/exec.(*Cmd).Start(0xc420258000, 0x4, 0x0)
        /usr/local/go/src/os/exec/exec.go:342 +0xfd
main.run(0xc42007e060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560699 [semacquire]:
os/exec.(*Cmd).Start(0xc42031a160, 0x4, 0x0)
        /usr/local/go/src/os/exec/exec.go:348 +0x7ac
main.run(0xc42007e060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560559 [semacquire]:
os.newProcess(0xfc37, 0x0, 0xc4202ce450)
        /usr/local/go/src/os/exec.go:23 +0x31
os.startProcess(0xc4202ce500, 0xd, 0xc4202ce450, 0x1, 0x1, 0xc42030cf08, 0xc4201843c0, 0x1d, 0x1d)
        /usr/local/go/src/os/exec_posix.go:49 +0x2c0
os.StartProcess(0xc4202ce500, 0xd, 0xc4202ce450, 0x1, 0x1, 0xc42030cf08, 0x0, 0x0, 0xd)
        /usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc420240000, 0x4, 0x0)
        /usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc42007e060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560649 [semacquire]:
os/exec.(*Cmd).writerDescriptor(0xc4202c4000, 0x0, 0x0, 0x1, 0xc4201188f0, 0xc42003b648)
        /usr/local/go/src/os/exec/exec.go:237 +0x665
os/exec.(*Cmd).stdout(0xc4202c4000, 0xc4202a6030, 0x0, 0x0)
        /usr/local/go/src/os/exec/exec.go:221 +0x3d
os/exec.(*Cmd).Start(0xc4202c4000, 0x4, 0x0)
        /usr/local/go/src/os/exec/exec.go:342 +0xfd
main.run(0xc42007e060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560772 [semacquire]:
syscall.ByteSliceFromString(0xc420012000, 0x2d, 0xc4201d0040, 0x11, 0x11, 0x0, 0x0)
        /usr/local/go/src/syscall/syscall.go:53 +0x64
syscall.BytePtrFromString(0xc420012000, 0x2d, 0xc4201d0040, 0x0, 0x0)
        /usr/local/go/src/syscall/syscall.go:69 +0x39
syscall.SlicePtrFromStrings(0xc4202565a0, 0x1d, 0x1d, 0xc4202e4390, 0x2, 0x2, 0x0, 0x0)
        /usr/local/go/src/syscall/exec_unix.go:87 +0x9a
syscall.forkExec(0xc4202e4300, 0xd, 0xc4202e40b0, 0x1, 0x1, 0xc4202a1d60, 0xc420278028, 0x40e996, 0xc4202a1cc8)
        /usr/local/go/src/syscall/exec_unix.go:158 +0x16e
syscall.StartProcess(0xc4202e4300, 0xd, 0xc4202e40b0, 0x1, 0x1, 0xc4202a1d60, 0x2, 0x4, 0x44ab30, 0xc42017a180)
        /usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc4202e4300, 0xd, 0xc4202e40b0, 0x1, 0x1, 0xc4202a1f08, 0xc4202565a0, 0x1d, 0x1d)
        /usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc4202e4300, 0xd, 0xc4202e40b0, 0x1, 0x1, 0xc4202a1f08, 0x0, 0x0, 0xd)
        /usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc42024a000, 0x4, 0x0)
        /usr/local/go/src/os/exec/exec.go:358 +0x3c9
main.run(0xc42007e060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560475 [semacquire]:
os.(*Process).wait(0xc42028c000, 0x0, 0xc42031a000, 0x0)
        /usr/local/go/src/os/exec_unix.go:48 +0x17a
os.(*Process).Wait(0xc42028c000, 0xc420310f08, 0x0, 0x0)
        /usr/local/go/src/os/doc.go:49 +0x2b
os/exec.(*Cmd).Wait(0xc42031a000, 0x0, 0x0)
        /usr/local/go/src/os/exec/exec.go:434 +0x6d
main.run(0xc42007e060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:12 +0x76
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560701 [semacquire]:
syscall.ByteSliceFromString(0xc4203321e0, 0xa, 0x4bb0c0, 0xc420332101, 0x100000000000a, 0x38, 0xb0)
        /usr/local/go/src/syscall/syscall.go:53 +0x64
syscall.BytePtrFromString(0xc4203321e0, 0xa, 0x8005260d0, 0xc420159d40, 0xa)
        /usr/local/go/src/syscall/syscall.go:69 +0x39
syscall.Stat(0xc4203321e0, 0xa, 0xc420328038, 0xc4203321e0, 0xa)
        /usr/local/go/src/syscall/zsyscall_freebsd_amd64.go:1130 +0x3f
os.Stat(0xc4203321e0, 0xa, 0x2, 0x2, 0xc4203321e0, 0xa)
        /usr/local/go/src/os/file_unix.go:169 +0x7d
os/exec.findExecutable(0xc4203321e0, 0xa, 0x2, 0xc4203321e0)
        /usr/local/go/src/os/exec/lp_unix.go:20 +0x39
os/exec.LookPath(0x4c8239, 0x4, 0x4c8239, 0x4, 0x1, 0x0)
        /usr/local/go/src/os/exec/lp_unix.go:53 +0x188
os/exec.Command(0x4c8239, 0x4, 0x0, 0x0, 0x0, 0xc420232420)
        /usr/local/go/src/os/exec/exec.go:135 +0x1ff
main.run(0xc42007e060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:10 +0x55
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d

goroutine 3560650 [semacquire]:
os.NewFile(0x12, 0x4c8a13, 0x9, 0x0)
        /usr/local/go/src/os/file_unix.go:56 +0x3f
os.OpenFile(0x4c8a13, 0x9, 0x1, 0xc400000000, 0xc420374d70, 0x40eb85, 0xc420254230)
        /usr/local/go/src/os/file_unix.go:123 +0xff
os/exec.(*Cmd).writerDescriptor(0xc420178000, 0x0, 0x0, 0xc420254238, 0xc420268020, 0xc420268020)
        /usr/local/go/src/os/exec/exec.go:233 +0x569
os/exec.(*Cmd).stderr(0xc420178000, 0xc420268020, 0x1, 0x1)
        /usr/local/go/src/os/exec/exec.go:228 +0x4e
os/exec.(*Cmd).Start(0xc420178000, 0x4, 0x0)
        /usr/local/go/src/os/exec/exec.go:342 +0xfd
main.run(0xc42007e060)
        /data/go/src/github.com/multiplay/go/apps/test/main.go:11 +0x68
created by main.main
        /data/go/src/github.com/multiplay/go/apps/test/main.go:30 +0x24d
     4006.16 real      3508.55 user     29207.58 sys```
stevenh commented 7 years ago

Here's the selection of errors it triggered:

fatal error: all goroutines are asleep - deadlock!

fatal error: workbuf is empty

runtime: nelems=256 nfree=233 nalloc=23 previous allocCount=18 nfreed=65531
fatal error: sweep increased allocation count

runtime: failed MSpanList_Remove 0x800698500 0x800b46d40 0x53adb0 0x53ada0
fatal error: MSpanList_Remove

This is results from 1.7.3 on FreeBSD 11.0

stevenh commented 7 years ago

Had it hang on a run last night here's a full trace with GOTRACEBACK=crash so includes the runtime routines. I also have the crash dump if its of use: https://gist.github.com/stevenh/4b6d5bde0b81503c1a4bca461a5322ae

derekmarcotte commented 7 years ago

Just wanted to give a quick status-update:

As has been mentioned the fork appears to be critical. In testing, I've removed all the Pipes plumbing in the parent, and exited the child immediately after fork (using SYS_EXIT), for my /doesnotexist example. This reliably crashes the process, with similar crashes. Removing the fork, Everything Is Fine. This implies that the exec syscall is not implicated.

What is surprising to me, is the nature of the call. i.e. - it is surprising that a child might affect a parent process, but if the syscall itself is to blame, well that is also surprising.

I can make this work available if anyone is interested.

I feel that my next step is to circumvent the regular runtime path that /doesnotexist takes, and write a second test using, say the RawSyscall interface alone.

stevenh commented 7 years ago

I'd add that it also seems to need GC as with GOGC=100 it always takes many hours here, with GOGC=2-5 then its usually between 20mins and 1 hour and I've never had it happen with GOGC=off.

I'm going to leave a GOGC=off test running over the weekend to try and confirm that.

derekmarcotte commented 7 years ago

Perhaps the next step is to stop garbage collection as part of ForkLock? Ima play in here for a bit.

stevenh commented 7 years ago

I thought the same so I'm running the forks with GC disabled using GOGC=off and then manually calling GC via debug.FreeOSMemory().

No results yet, but here's the code for reference.

This adds additional synchronisation due to the use of constant goroutines, which might also have an impact but we'll have to see. It will be an interesting data point either way.

package main

import (
        "fmt"
        "os/exec"
        "runtime"
        "runtime/debug"
        "time"
)

var (
        gcPeriod     = time.Second * 10
        forkRoutines = 16
)

func run(work, done chan struct{}) {
        for range work {
                cmd := exec.Command("/usr/bin/true")
                cmd.Start()
                cmd.Wait()

                done <- struct{}{}
        }
}

func main() {
        fmt.Printf("Starting %v forking goroutines...\n", forkRoutines)
        fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))

        done := make(chan struct{}, forkRoutines)
        work := make(chan struct{}, forkRoutines)

        for i := 0; i < forkRoutines; i++ {
                work <- struct{}{}
                go run(work, done)
        }

        for {
                start := time.Now()
                forks := 0
                active := forkRoutines
        forking:
                for range done {
                        forks++
                        if time.Since(start) > gcPeriod {
                                active--
                                if active == 0 {
                                        break forking
                                }
                        } else {
                                work <- struct{}{}
                        }
                }

                fmt.Println("forks:", forks)
                debug.FreeOSMemory()

                for i := 0; i < forkRoutines; i++ {
                        work <- struct{}{}
                }
        }
}
stevenh commented 7 years ago

Here's an alternative version which is as close to the known bad version as I think we can get without editing the runtime:

package main

import (
        "fmt"
        "os/exec"
        "runtime"
        "time"
)

var (
        gcPeriod     = time.Second * 10
        forkRoutines = 16
)

func run(done chan struct{}) {
        cmd := exec.Command("/usr/bin/true")
        cmd.Start()
        cmd.Wait()

        done <- struct{}{}
}

func main() {
        fmt.Printf("Starting %v forking goroutines...\n", forkRoutines)
        fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))

        done := make(chan struct{}, forkRoutines*2)

        for i := 0; i < forkRoutines; i++ {
                go run(done)
        }

        for {
                start := time.Now()
                active := forkRoutines
        forking:
                for range done {
                        if time.Since(start) > gcPeriod {
                                active--
                                if active == 0 {
                                        break forking
                                }
                        } else {
                                go run(done)
                        }
                }

                runtime.GC()

                for i := 0; i < forkRoutines; i++ {
                        go run(done)
                }
        }
}

I've been running these now here for a few hours with no errors and I usually see an error within 20m - 1 hour with GOGC=5

This seems to indicate the issue maybe centered around GC kicking in while there is a fork in progress.

stevenh commented 7 years ago

Both sets of code have now been running for 12 hours with no errors, so looking more like this is indeed to do with the GC interacting with active forks causing corruption.

bradfitz commented 7 years ago

@stevenh,

and then manually calling GC via debug.FreeOSMemory()

That doesn't do what you think it does. That's not how you manually call a GC. That would be runtime.GC(), but calling runtime.GC() can be super slow. (it does a full GC while blocking the process)

Just disable & reenable with:

   old := debug.SetGCPercent(0) // now disabled.
   // forklock..fork..exec...unlock
   debug.SetGCPercent(old) // now GC is re-enabled.
stevenh commented 7 years ago

I did actually check that and saw debug.FreeOSMemory() is freeOSMemory which in turn is:

func runtime_debug_freeOSMemory() {
        gcStart(gcForceBlockMode, false)
        systemstack(func() { mheap_.scavenge(-1, ^uint64(0), 0) })
}

Where as runtime.GC() is:

        gcStart(gcForceBlockMode, false)
}

So it looks like FreeOSMemory does do a full GC + trying to get the OS to cleanup. This is reinforced by the docs for FreeOSMemory:

FreeOSMemory forces a garbage collection followed by an attempt to return as much memory to the operating system as possible. (Even if this is not called, the runtime gradually returns memory to the operating system in a background task.)

I've also confirmed that memory usage is maintained which is not the case with just GOGC=off.

Am I really mistaken?

I could use

old := debug.SetGCPercent(0)
...
// Force GC
debug.SetGCPercent(old)
debug.SetGCPercent(0)

Its also worth noting that SetGCPercent directly calls runtime.GC() so would still result in a stop the world GC?

stevenh commented 7 years ago

Also worth noting @bradfitz that while the first used debug.FreeOSMemory() the second uses runtime.GC() as I was interested in the difference.

bradfitz commented 7 years ago

I guess I missed or don't understand why you want to force a GC. Or why you'd want to return memory to the operating system.

I thought the whole point was to stop the GC so it's not running during a fork. That's much lighter weight than running a GC.

stevenh commented 7 years ago

The idea of the tests is to confirm that we can fork without error if the GC was disabled.

As the original test used multiple routines doing the forking, to force the issue quicker, I wanted to maintain that. Given this it seemed easier to just do a stop the world GC than to try and coordinate ensuring that GC was stopped across all the routines as they fork.

The first version used debug.FreeOSMemory() in case the freeing the memory to the OS was a contributing factor to the issue.

The second version I switched to runtime.GC() to see if removing this fact and allowing background free made any difference.

I did consider using debug.SetGCPercent(X) but after looking at it and noticing it actually calls runtime.GC() anyway, I concluded it would be almost identical to calling runtime.GC() while all the forking routines where blocked which was as mentioned previously was easier to coordinate.

Both the test versions have now been running for 24hours so it seems highly likely that the GC while forking is the cause of this issue.

I'll leave them running for now, but I need a way to confirm for sure and also to narrow down the exact piece of code which is causing the issue.

Any ideas gratefully received.

ianlancetaylor commented 7 years ago

Given that we believe that this only occurs on FreeBSD, and only occurs when running on real hardware, I would guess that the problem is not the GC as such, but rather than fact that another thread in the parent process is doing a lot of reads and writes to memory pages that are also being read and written by the child process.

stevenh commented 7 years ago

Hmm why do you say it only occurs on real hardware @ianlancetaylor?

Reading back through the thread I see @kaey mention it telegraph crashes for him on vmware too, so interested to know if there's another piece of the puzzle there?

I'm wondering if PCID may be at play here...

minux commented 7 years ago

If it's PCID related, does "vm.pmap.pcid_enabled=0" help?

stevenh commented 7 years ago

That's exactly what I just tried unfortunately not :(

ianlancetaylor commented 7 years ago

@stevenh I said "real hardware" based on @reezer 's comment in https://github.com/golang/go/issues/15658#issuecomment-264165583. My apologies if I misunderstood.

stevenh commented 7 years ago

In an attempt to produce a minimal version I've tried creating a C version but currently no joy.

Capturing here for reference.

#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/resource.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <stdlib.h>
#include <errno.h>
#include <pthread.h>

#define MEMSIZE 10240
#define RANDMAX 4
#define THREADS 16

#define handle_error_en(en, msg) \
        do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0)

#define handle_error(msg) \
        do { perror(msg); exit(EXIT_FAILURE); } while (0)

struct thread_info {
        pthread_t thread_id;
        int thread_num;
};

void
check(int val, int *i, int *mem)
{
        int j;

        if (*i != val) {
                printf("i: %d\n", *i);
                exit(EXIT_FAILURE);
        }

        for (j = 0; j < MEMSIZE; j++) {
                if (mem[j] != val) {
                        printf("mem: %d = %d\n", j, mem[j]);
                        exit(EXIT_FAILURE);
                }
        }
}

void
randomise(int val, int *i, int *mem)
{
        struct timespec ts;

        ts.tv_sec = 0;
        srandom((unsigned long)val);

        for (; val < RANDMAX; val+=2) {
                *i = val;
                for (int j = 0; j < MEMSIZE; j++) {
                        mem[j] = val;
                }

                check(val, i, mem);
        }
}

void
dofork()
{
        pid_t pid;
        int i;
        int mem[MEMSIZE];

        i = 0;
        for (int j = 0; j < MEMSIZE; j++) {
                mem[j] = 0;
        }

        pid = fork();
        switch (pid) {
        case -1:
                handle_error("fork");
        case 0: {
                // child
                randomise(1, &i, mem);
                exit(execv("/usr/bin/true", NULL));
        }
        default: {
                int status;
                struct rusage ru;
                // parent
                randomise(2, &i, mem);

                // wait6 followed by wait4 as that's what golang does        
                wait6(P_PID, pid, 0, WNOWAIT|WEXITED, 0, 0);
                wait4(pid, &status, 0, &ru);
        }}
}

static void *
thread_start(void *arg)
{
        struct thread_info *tinfo = arg;
        char info[1024], *ret;

        sprintf(info, "thread %d\n", tinfo->thread_num);

        ret = strdup(info);
        if (ret == NULL) {
                handle_error("strdup");
        }

        while (1==1) {
                dofork();
        }

        return ret;
}

int
main(int argc, char**argv)
{
        int s, tnum;
        struct thread_info *tinfo;
        pthread_attr_t attr;
        void *res;

        s = pthread_attr_init(&attr);
        if (s != 0) {
                handle_error_en(s, "pthread_attr_init");
        }

        tinfo = calloc(THREADS, sizeof(struct thread_info));
        if (tinfo == NULL) {
                handle_error("calloc");
        }

        for (tnum = 0; tnum < THREADS; tnum++) {
                tinfo[tnum].thread_num = tnum + 1;
                s = pthread_create(&tinfo[tnum].thread_id, &attr, &thread_start, &tinfo[tnum]);
                if (s != 0) {
                        handle_error_en(s, "pthread_create");
                }
        }

        s = pthread_attr_destroy(&attr);
        if (s != 0) {
                handle_error_en(s, "pthread_attr_destroy");
        }

        for (tnum = 0; tnum < THREADS; tnum++) {
                s = pthread_join(tinfo[tnum].thread_id, &res);
                if (s != 0) {
                        handle_error_en(s, "pthread_join");
                }
                printf("joined with thread %d; returned value was %s\n", tinfo[tnum].thread_num, (char *)res);
                free(res);
        }

        free(tinfo);
        exit(EXIT_SUCCESS);
}
stevenh commented 7 years ago

I've been looking through the FreeBSD specific code in this area and I noticed that newosproc uses raw sigprocmask.

If its possible that newosproc can be called from multiple threads at the same time then its possible that signals which are meant to be blocked during thr_new could be re-enabled unexpectedly causing undefined behaviour.

This is is why man SIGPROCMASK(2) states: In threaded applications, pthread_sigmask(3) must be used instead of sigprocmask().

This may be unrelated but...

ianlancetaylor commented 7 years ago

I looked at the FreeBSD library sources, and it looks like pthread_sigmask just calls sigprocmask after updating some user space information. There is no separate pthread_sigmask system call. So I think the Go code is doing the right thing in making the sigprocmask system call.

stevenh commented 7 years ago

libthr maintains internal state to ensure that signals don't get unblocked when they shouldn't (keeps a blocked counter and only unblocks when the counter drops to zero). This seems a bit quirky to me at first glance so I'd have to spend more time looking to see if there is any other things it does to ensure correct handling of signals to threaded apps it provides.

It struck me that as go can be run different routines on a given kernel thread, then that when combined with the direct use of sigprocmask it may be possible for a signal blocked from one could then be unintentionally unblocked by another without additional protection in the runtime.

I'm not saying this is possible or that this could be the issue, but just highlighting something I noticed while investigating.

stevenh commented 7 years ago

Something else that I've just noticed is that when we get the failed MSpanList_Remove panic it seems to always due to span.list != list with list off by 10 every time, which doesn't seem random.

Does this ring any bells?

derekmarcotte commented 7 years ago

Some progress: I was able to add a temporary method to the runtime to test only the fork portion of the program. This work is available here:

https://github.com/derekmarcotte/go/commit/9a3667bc261085793f7386064699eb3f020d1a4b

Additionally, a test case that exercises that method is available here:

https://raw.githubusercontent.com/derekmarcotte/goruntime-testing/master/src/15658-forkonly.go

This test case has wedged once (with 3 threads in 100%, and the rest in idle), and crashed once, out of two attempts:

acquirep: p->m=842351068160(8) p->status=1 fatal error: acquirep: invalid p state

This is definitely some interaction with the fork system call. I'll next disable garbage collection and see where it goes. Will post results.

EDIT: Whoop, thought I based my branch off of master. Have rebased and will be testing with GOGC=5, and then GOGC=off. New links to come in a few days.

derekmarcotte commented 7 years ago

Just wanted to add, using the 15658-forkonly.go merged against master, of three tests:

I'll be posting updates to my branch later today.

Interesting to note: all of the panics now seem to be in acquirep.

stevenh commented 7 years ago

Interesting, do you have the full trace from the GOGC=off case?

I don't suppose you have the output from procstat -k -k do you?

Given how simple that is now, it really smells like this could either be signals or something more fundamental like _umtx_op interation.

derekmarcotte commented 7 years ago

@stevenh:

Trace: https://github.com/derekmarcotte/goruntime-testing/blob/master/logs/15658-forkonly-log-head-2

No procstat, but I can try again.

derekmarcotte commented 7 years ago

Here's the test function integrated with head: https://github.com/derekmarcotte/go/commit/1deb3ec41080ce69dc4e980716fcf07db34825ac

stevenh commented 7 years ago

Been running your new code with ktrace since Wed Dec 14 10:06:02 UTC 2016 no crashes yet :(

stevenh commented 7 years ago

Still no crashes with your test case under ktrace after 7 days @derekmarcotte :(

I'll leave it till tomorrow then try to see if trigger it without ktrace.

stevenh commented 7 years ago

Still unable to reproduce the issue after 8 days running with GOGC=OFF and no ktrace.

Are you sure that managed to reproduce the issue with GOGC=OFF @derekmarcotte?

derekmarcotte commented 7 years ago

@stevenh: FWIW, I have not been able to reproduce the crashes with GOGC=OFF, on an Intel Xeon (with ECC RAM), only on my AMD FX-8350. I'm going to (start to) verify GOGC=5 on the Xeon today. If you have an AMD (+FreeBSD 10.3 amd64) machine at your disposal, that would be interesting.

stevenh commented 7 years ago

I'm afraid I don't have access to any AMD HW. I'm currently running with GOGC=2 + ktrace on a E5-2640 since the 1st no crashes yet but it does seem to take quite a while with ktrace enabled.

derekmarcotte commented 7 years ago

Here's a GOGC=5, wedged, with with thread at 100%, and the rest at 0%:

procstat -k -k 62812
PID TID COMM TDNAME KSTACK
62812 100817 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_timedwait_sig+0x10 _sleep+0x238 umtxq_sleep+0x125 do_wait+0x387 umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb 62812 100928 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb 62812 100937 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb 62812 101090 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb 62812 102891 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb 62812 102896 15658-forkonly-+5b6 15658-forkonly-+
62812 102897 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387
umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb 62812 102898 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb 62812 102899 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_timedwait_sig+0x10 _sleep+0x238 umtxq_sleep+0x125 do_wait+0x387 umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb 62812 102900 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb 62812 103024 15658-forkonly-+5b6 15658-forkonly-+ mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x387 umtx_op_wait_uint_private+0x83 amd64_syscall+0x357 Xfast_syscall+0xfb

I'm guessing they are blocked on the ForkLock, but I don't know how to interpret this output. I ran ktrace for a bit (~30s), and got the following (first number is counts):

1716125 CALL clock_gettime(0x4,0xc420063e88) 44278 CALL clock_gettime(0x4,0xc4201adee0) 2 CALL clock_gettime(0x4,0xc420039eb0) 1 CALL clock_gettime(0x4,0xc420039f10) 1 CALL clock_gettime(0,0xc420039ef0) 1760407 RET clock_gettime 0 858063 CALL sched_yield 858062 RET sched_yield 0 22139 CALL _umtx_op(0x4f62c8,UMTX_OP_WAIT_UINT_PRIVATE,0,0x18,0xc4201ade78) 1 CALL _umtx_op(0x4f62d8,UMTX_OP_WAIT_UINT_PRIVATE,0,0x18,0xc420039e48) 22140 RET _umtx_op -1 errno 60 Operation timed out
1 CALL nanosleep(0xc420039f18,0) 1 RET nanosleep 0https://github.com/golang/go/issues/17168

Interesting to note: I'm running ntp on this machine. I think there was another issue that depending on a MONOTONIC clock on FreeBSD. Are you running NTP?

For those that cannot reproduce, what if you run NTP?

stevenh commented 7 years ago

Interesting find, I am running NTP here.

It doesn't seem like this could be the cause of the corruption given the description but I may be wrong.

stevenh commented 7 years ago

I've been unable to get anything other than the single thread stuck @ 100% CPU while exercising https://github.com/derekmarcotte/go/commit/1deb3ec41080ce69dc4e980716fcf07db34825ac

I'm going to try running with 1.7.4 + https://github.com/golang/go/commit/360f2e43b78a3ea119ea8dce9649f7c1227d793b to see if this behaviour changes.

stevenh commented 7 years ago

Scratch that as https://github.com/golang/go/commit/360f2e43b78a3ea119ea8dce9649f7c1227d793b was already included in 1.7 by https://github.com/golang/go/commit/a2f37b7fe130e1a5da78b605f4191547e9868944