golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.87k stars 17.65k forks source link

runtime: some runtime throws print all goroutine stack dumps by default #46995

Open mvdan opened 3 years ago

mvdan commented 3 years ago

I'm going to be using go version devel go1.17-d19a53338f Wed Jun 30 02:00:49 2021 +0000 linux/amd64 as an example here, but this also reproduces on 1.16.5.

From https://golang.org/pkg/runtime/, emphasis mine:

The GOTRACEBACK variable controls the amount of output generated when a Go program fails due to an unrecovered panic or an unexpected runtime condition. By default, a failure prints a stack trace for the current goroutine, eliding functions internal to the run-time system, and then exits with exit code 2.

So it seems like, by default, a panic should only print the stack trace for the goroutine that panicked, and nothing else.

However, that doesn't seem correct in some cases: https://play.golang.org/p/zWw2TUS4YzZ

$ go run -trimpath f.go
fatal error: concurrent map writes

goroutine 7 [running]:
runtime.throw({0x467980, 0x0})
    runtime/panic.go:1198 +0x71 fp=0xc00003cf78 sp=0xc00003cf48 pc=0x42cb91
runtime.mapassign_fast64(0x45f100, 0xc000072060, 0x2)
    runtime/map_fast64.go:176 +0x2b4 fp=0xc00003cfb0 sp=0xc00003cf78 pc=0x40dcb4
main.main.func3()
    ./f.go:12 +0x2e fp=0xc00003cfe0 sp=0xc00003cfb0 pc=0x45814e
runtime.goexit()
    runtime/asm_amd64.s:1581 +0x1 fp=0xc00003cfe8 sp=0xc00003cfe0 pc=0x453dc1
created by main.main
    ./f.go:10 +0x85

goroutine 1 [sleep]:
time.Sleep(0x3b9aca00)
    runtime/time.go:193 +0x12e
main.main()
    ./f.go:21 +0xd1

goroutine 5 [sleep]:
time.Sleep(0xdf8475800)
    runtime/time.go:193 +0x12e
main.main.func1()
    ./f.go:6 +0x25
created by main.main
    ./f.go:6 +0x26

goroutine 6 [sleep]:
time.Sleep(0xdf8475800)
    runtime/time.go:193 +0x12e
main.main.func2()
    ./f.go:7 +0x25
created by main.main
    ./f.go:7 +0x34

goroutine 8 [runnable]:
main.main.func4()
    ./f.go:17 +0x2e
created by main.main
    ./f.go:15 +0xc7
exit status 2

As per the docs, I would expect to just see one goroutine stack dump - the "running" one that hit the panic via runtime.mapassign_fast64, but I get all of them.

This seems to be on purpose in the code:

https://github.com/golang/go/blob/d19a53338fa6272b4fe9c39d66812a79e1464cd2/src/runtime/panic.go#L1385-L1399

This is problematic for two reasons:

A: It's confusing. This had me puzzled when looking at a Go process crash output along with the runtime docs.

B: It can lead to large amounts of panic output, depending on what kind of panic one runs into. At work, someone was very surprised that a panic dumped ~50MiB of goroutine stacks, while most other panics dumped just one goroutine stack. That actually caused infra issues, because it appears our log ingestion didn't like that sudden dump of output very much, and we never found out before given that most panics don't behave that way.

I see three possible resolutions to this problem:

1) The code is working as intended. We should update the docs to clarify the nuance that the default does print all goroutine stack traces in some cases, and what those cases are.

2) We should amend the implementation to match the docs. Concurrent map writes should result in a panic that only prints one stack trace, i.e. the "running" goroutine from above.

3) Something in between, such as "print the stack traces for all goroutines that are involved in writes to the map in question". I can imagine that the current behavior exists for debugging purposes: if you get concurrent map writes, you might want to know what those two or more goroutines were doing, not just one of them. However, you presumably don't need all goroutines for that; just the ones involved in the race.

cc @aclements @randall77 @mknyszek @prattmic as per https://dev.golang.org/owners

mvdan commented 3 years ago

For the sake of completeness, here's an example of "only print the current goroutine by default" working as documented:

$ cat f.go
package main

import "time"

func main() {
    go func() { time.Sleep(time.Minute) }()
    go func() { time.Sleep(time.Minute) }()

    go func() {
        time.Sleep(time.Second)
        panic("foo")
    }()

    time.Sleep(time.Minute)
}
$ go run -trimpath f.go
panic: foo

goroutine 7 [running]:
main.main.func3()
    ./f.go:11 +0x31
created by main.main
    ./f.go:9 +0x45
exit status 2
randall77 commented 3 years ago

I think the distinction here is the difference between panic and throw. A panic prints only the goroutine in question. A throw prints all of them. Normally throw is only used when the runtime discovers a bug in itself, in which case printing everything is fine, in fact helpful, in that it could be useful for a bug report on the runtime itself. When the runtime discovers a bug in the user program, like the concurrent map write detector, we also use throw. Maybe we shouldn't, or maybe we should use a variant that prints only the current goroutine.

mvdan commented 3 years ago

I see; I hadn't caught on to the distinction between panic and throw. Note that the docs simply talk about "an unrecovered panic or an unexpected runtime condition", which I would assume would cover both. So if we want to keep that distinction, maybe that's also something to clarify in the godoc.

Regardless, I agree that concurrent map writes shouldn't dump all goroutines - they aren't a bug in the runtime package itself. I'd think that any such runtime failures due to bugs in the user's code should cause something akin to the current panic, not the current throw.

prattmic commented 3 years ago

This has come up in the past in the context of differentiating throws due to "user bugs" (unlock of unlocked mutex, concurrent map write, etc) and "runtime bugs" (most throws) when analyzing crashes. Perhaps we should have a userThrow for these fatal user bugs, which could also more closely match the proper GOTRACEBACK setting.

aclements commented 3 years ago

I agree that we should differentiate runtime-internal fatal errors from user-caused fatal errors, but isn't "concurrent map writes" actually one of the few cases where it is useful to see the other goroutines so you have a chance of catching the other goroutine that's writing to the map?

randall77 commented 3 years ago

I think it is unlikely to actually catch the particular write itself. By the time the runtime stops that goroutine it has finished the write and moved on. (write-write conflicts might actually catch both offenders, but for read-write conflicts we only catch the read side.) It's quite possible, however, to catch the other goroutine not too long after the write, which might still be useful (e.g. still in the same parent function, or it's doing a bunch of writes in a loop, etc.).

mvdan commented 3 years ago

isn't "concurrent map writes" actually one of the few cases where it is useful to see the other goroutines so you have a chance of catching the other goroutine that's writing to the map?

Perhaps, but note that dumping all goroutines is an entirely different matter. See my mention above of a problem we ran into on production: all panics we had seen up to that point were under a kilobyte, as they only contained one stack trace. Then one day we hit a concurrent map write, and the tens of megabytes dumped all of a sudden caused the logging system to get stuck.

You can say that's something that we should fix, but it's also extremely surprising that a tiny minority of user-caused panics can surprisingly dump all goroutines :)

I wouldn't oppose if this map panic printed the stack traces of all goroutines involved in the data race, which I assume would almost never be more than a handful. If that's not possible, then I think it should just use the running goroutine.

sbunce commented 2 years ago

(user experience report)

Encountered this issue at my work yesterday. We had a service crash in production due to concurrent map writes. Our debugging got slowed down a bit because our process printed stack traces for hundreds of thousands of goroutines. Our log viewer "kubectl logs " was only displaying the tail end of the stack traces, because of the length. This made it hard to understand the cause until we figured out how to get all of the logs.