Open mvdan opened 3 years ago
For the sake of completeness, here's an example of "only print the current goroutine by default" working as documented:
$ cat f.go
package main
import "time"
func main() {
go func() { time.Sleep(time.Minute) }()
go func() { time.Sleep(time.Minute) }()
go func() {
time.Sleep(time.Second)
panic("foo")
}()
time.Sleep(time.Minute)
}
$ go run -trimpath f.go
panic: foo
goroutine 7 [running]:
main.main.func3()
./f.go:11 +0x31
created by main.main
./f.go:9 +0x45
exit status 2
I think the distinction here is the difference between panic
and throw
. A panic
prints only the goroutine in question. A throw
prints all of them.
Normally throw
is only used when the runtime discovers a bug in itself, in which case printing everything is fine, in fact helpful, in that it could be useful for a bug report on the runtime itself.
When the runtime discovers a bug in the user program, like the concurrent map write detector, we also use throw
. Maybe we shouldn't, or maybe we should use a variant that prints only the current goroutine.
I see; I hadn't caught on to the distinction between panic and throw. Note that the docs simply talk about "an unrecovered panic or an unexpected runtime condition", which I would assume would cover both. So if we want to keep that distinction, maybe that's also something to clarify in the godoc.
Regardless, I agree that concurrent map writes shouldn't dump all goroutines - they aren't a bug in the runtime package itself. I'd think that any such runtime failures due to bugs in the user's code should cause something akin to the current panic, not the current throw.
This has come up in the past in the context of differentiating throws due to "user bugs" (unlock of unlocked mutex, concurrent map write, etc) and "runtime bugs" (most throws) when analyzing crashes. Perhaps we should have a userThrow
for these fatal user bugs, which could also more closely match the proper GOTRACEBACK
setting.
I agree that we should differentiate runtime-internal fatal errors from user-caused fatal errors, but isn't "concurrent map writes" actually one of the few cases where it is useful to see the other goroutines so you have a chance of catching the other goroutine that's writing to the map?
I think it is unlikely to actually catch the particular write itself. By the time the runtime stops that goroutine it has finished the write and moved on. (write-write conflicts might actually catch both offenders, but for read-write conflicts we only catch the read side.) It's quite possible, however, to catch the other goroutine not too long after the write, which might still be useful (e.g. still in the same parent function, or it's doing a bunch of writes in a loop, etc.).
isn't "concurrent map writes" actually one of the few cases where it is useful to see the other goroutines so you have a chance of catching the other goroutine that's writing to the map?
Perhaps, but note that dumping all goroutines is an entirely different matter. See my mention above of a problem we ran into on production: all panics we had seen up to that point were under a kilobyte, as they only contained one stack trace. Then one day we hit a concurrent map write, and the tens of megabytes dumped all of a sudden caused the logging system to get stuck.
You can say that's something that we should fix, but it's also extremely surprising that a tiny minority of user-caused panics can surprisingly dump all goroutines :)
I wouldn't oppose if this map panic printed the stack traces of all goroutines involved in the data race, which I assume would almost never be more than a handful. If that's not possible, then I think it should just use the running goroutine.
(user experience report)
Encountered this issue at my work yesterday. We had a service crash in production due to concurrent map writes. Our debugging got slowed down a bit because our process printed stack traces for hundreds of thousands of goroutines. Our log viewer "kubectl logs
I'm going to be using
go version devel go1.17-d19a53338f Wed Jun 30 02:00:49 2021 +0000 linux/amd64
as an example here, but this also reproduces on 1.16.5.From https://golang.org/pkg/runtime/, emphasis mine:
So it seems like, by default, a panic should only print the stack trace for the goroutine that panicked, and nothing else.
However, that doesn't seem correct in some cases: https://play.golang.org/p/zWw2TUS4YzZ
As per the docs, I would expect to just see one goroutine stack dump - the "running" one that hit the panic via
runtime.mapassign_fast64
, but I get all of them.This seems to be on purpose in the code:
https://github.com/golang/go/blob/d19a53338fa6272b4fe9c39d66812a79e1464cd2/src/runtime/panic.go#L1385-L1399
This is problematic for two reasons:
A: It's confusing. This had me puzzled when looking at a Go process crash output along with the runtime docs.
B: It can lead to large amounts of panic output, depending on what kind of panic one runs into. At work, someone was very surprised that a panic dumped ~50MiB of goroutine stacks, while most other panics dumped just one goroutine stack. That actually caused infra issues, because it appears our log ingestion didn't like that sudden dump of output very much, and we never found out before given that most panics don't behave that way.
I see three possible resolutions to this problem:
1) The code is working as intended. We should update the docs to clarify the nuance that the default does print all goroutine stack traces in some cases, and what those cases are.
2) We should amend the implementation to match the docs. Concurrent map writes should result in a panic that only prints one stack trace, i.e. the "running" goroutine from above.
3) Something in between, such as "print the stack traces for all goroutines that are involved in writes to the map in question". I can imagine that the current behavior exists for debugging purposes: if you get concurrent map writes, you might want to know what those two or more goroutines were doing, not just one of them. However, you presumably don't need all goroutines for that; just the ones involved in the race.
cc @aclements @randall77 @mknyszek @prattmic as per https://dev.golang.org/owners