Open gopherbot opened 1 year ago
Found new dashboard test flakes for:
#!watchflakes
post <- repo == "exp" && section == "go test -short ./..." && test == ""
Runtime heap corruption in cmd/go
during go test
. Looks like a runtime bug to me?
@prattmic notes that this is a failure on Darwin and there's a fork/exec goroutine here. Sounds like an existing issue possibly (#59995)?
We should probably aggregate such issues into a single issue.
This failure mode looks substantially different from the other Darwin cluster described in https://github.com/golang/go/issues/54461#issuecomment-1591922415 (which includes #59995).
That cluster of failures is characterized by a goroutine stuck copying a pipe connected to a dead child process: either stdout
, stderr
, or (more rarely) the one created by syscall.forkExecPipe
, but there is typically no sign of memory corruption. In contrast, the log attached to this issue shows memory corruption, and has a syscall.Wait4
frame on the stack that suggests that the child process is probably still alive.
So I think this issue is probably independent of #59995.
Agreed. I changed my mind after posting the last message and looking at the other failures again, but then I had a meeting. :P
I was going back through issues assigned to me (oops, I have a backlog) and I noticed that #58715 is also memory corruption, but none of those failures include forkExec. Just mentioning it here since it might be related.
https://github.com/golang/go/issues/55167#issuecomment-1595404481 looks like plausibly the same kind of corruption; updated the watchflakes
pattern to catch it more specifically.
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && `fatal error: found pointer to free object`
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (`fatal error: found pointer to free object` || `fatal error: found bad pointer in Go heap`)
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (`fatal error: found pointer to free object` || `fatal error: found bad pointer in Go heap` || `fatal error: workbuf is not empty`)
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (`fatal error: found pointer to free object` || `fatal error: found bad pointer in Go heap` || `fatal error: workbuf is not empty`)
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (`fatal error: found pointer to free object` || `fatal error: found bad pointer in Go heap` || `fatal error: workbuf is not empty` || `schedule: holding locks` || `fatal error: bad summary data`)
https://github.com/golang/go/issues/62165 may be related.
Actually, it almost certainly is the same memory corruption. Folding it in.
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (`fatal error: found pointer to free object` || `fatal error: found bad pointer in Go heap` || `fatal error: workbuf is not empty` || `schedule: holding locks` || `fatal error: bad summary data` || (`panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`))
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (
`fatal error: found pointer to free object` ||
`fatal error: found bad pointer in Go heap` ||
`fatal error: workbuf is not empty` ||
`schedule: holding locks` ||
`fatal error: bad summary data` ||
`fatal: morestack on gsignal` ||
(`panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`))
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (
`fatal error: found pointer to free object` ||
`fatal error: found bad pointer in Go heap` ||
`fatal error: workbuf is not empty` ||
`schedule: holding locks` ||
`fatal error: bad summary data` ||
`fatal: morestack on gsignal` ||
(`panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`))
Possibly related, seen in https://storage.googleapis.com/go-build-log/08b2f1f7/darwin-amd64-13_5b47b2a9.log (from CL 538678):
fatal error: workbuf is empty
(notably the opposite of "not empty")
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (
`fatal error: found pointer to free object` ||
`fatal error: found bad pointer in Go heap` ||
`fatal error: workbuf is not empty` ||
`schedule: holding locks` ||
`fatal error: bad summary data` ||
`fatal: morestack on gsignal` ||
(`panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`))
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (
log ~ `fatal error: found pointer to free object` ||
log ~ `fatal error: found bad pointer in Go heap` ||
log ~ `fatal error: workbuf is not empty` ||
log ~ `schedule: holding locks` ||
log ~ `fatal error: bad summary data` ||
log ~ `fatal: morestack on gsignal` ||
(log ~ `panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`))
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (
log ~ `fatal error: found pointer to free object` ||
log ~ `fatal error: found bad pointer in Go heap` ||
log ~ `fatal error: workbuf is not empty` ||
log ~ `schedule: holding locks` ||
log ~ `fatal error: bad summary data` ||
log ~ `fatal: morestack on gsignal` ||
(log ~ `panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`))
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (
log ~ `fatal error: found pointer to free object` ||
log ~ `fatal error: found bad pointer in Go heap` ||
log ~ `fatal error: workbuf is not empty` ||
log ~ `schedule: holding locks` ||
log ~ `fatal error: bad summary data` ||
log ~ `fatal: morestack on gsignal` ||
(log ~ `panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`))
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (
log ~ `fatal error: found pointer to free object` ||
log ~ `fatal error: found bad pointer in Go heap` ||
log ~ `fatal error: workbuf is not empty` ||
log ~ `schedule: holding locks` ||
log ~ `fatal error: bad summary data` ||
log ~ `fatal: morestack on gsignal` ||
(log ~ `panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`) ||
(log ~ `SIGSEGV: segmentation violation` && log ~ `^\s*runtime\.spanOfUnchecked\(`))
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (
log ~ `fatal error: found pointer to free object` ||
log ~ `fatal error: found bad pointer in Go heap` ||
log ~ `fatal error: workbuf is not empty` ||
log ~ `schedule: holding locks` ||
log ~ `fatal error: bad summary data` ||
log ~ `fatal: morestack on gsignal` ||
(log ~ `panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`) ||
(log ~ `SIGSEGV: segmentation violation` && log ~ `^\s*runtime\.spanOfUnchecked\(`))
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (
log ~ `fatal error: found pointer to free object` ||
log ~ `fatal error: found bad pointer in Go heap` ||
log ~ `fatal error: workbuf is not empty` ||
log ~ `schedule: holding locks` ||
log ~ `fatal error: bad summary data` ||
log ~ `fatal: morestack on gsignal` ||
(log ~ `panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`) ||
(log ~ `SIGSEGV: segmentation violation` && log ~ `^\s*runtime\.spanOfUnchecked\(`))
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (
log ~ `fatal error: found pointer to free object` ||
log ~ `fatal error: found bad pointer in Go heap` ||
log ~ `fatal error: workbuf is not empty` ||
log ~ `schedule: holding locks` ||
log ~ `fatal error: bad summary data` ||
log ~ `fatal: morestack on gsignal` ||
(log ~ `panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`) ||
(log ~ `SIGSEGV: segmentation violation` && log ~ `^\s*runtime\.spanOfUnchecked\(`))
I wonder if this might be due to an uninitialized pointer field in a syscall result. That could explain the bad pointers, at least, but I guess it doesn't explain the other failure modes.
darwin/amd64
is supposed to be a first class port.
@golang/runtime, should this be marked as a release-blocker for Go 1.22?
Possibly related?
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (
log ~ `fatal error: found pointer to free object` ||
log ~ `fatal error: found bad pointer in Go heap` ||
log ~ `fatal error: workbuf is not empty` ||
log ~ `schedule: holding locks` ||
log ~ `fatal error: bad summary data` ||
log ~ `fatal: morestack on gsignal` ||
log ~ `runtime\.\(\*mheap\)\.allocNeedsZero` ||
(log ~ `panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`) ||
(log ~ `SIGSEGV: segmentation violation` && log ~ `^\s*runtime\.spanOfUnchecked\(`))
darwin/amd64
is supposed to be a first class port.@golang/runtime, should this be marked as a release-blocker for Go 1.22?
I may be misunderstanding the policy, but I think technically it's not a release-blocker since we released Go 1.21 with the same issue, so it's not new to this release. (The fact that we released Go 1.21 without figuring out what it is is on it's own something that we should maybe revisit, but I think that's where we are today.) I don't want to be the sole judge on this so I'll add it to the next weekly meeting and/or revisit this in triage.
@cherrymui spent a good bit of time investigating this and had a lot of difficulty reproducing. (Mentioning that here because I don't think that was recorded in this issue.) The crash rate is also fairly low, so it's difficult to continue to prioritize this against other things that need fixing, especially since the amount of time this bug could take to resolve is unknown and likely high. Again, not trying to make a decision, just trying to rationalize how we got here.
The other aspect to this is that it doesn't seem to have impacted downstream users much, if at all. It's possible that users are just ignoring it, but IIRC other Darwin issues that appeared on the builders have been reported in the past, such as the CPU profile related one.
This makes me wonder if maybe it's our postsubmit Mac machines specifically that are the problem. It also occurs to me that I haven't seen the same kinds of failures on the LUCI infrastructure. Those builders have been up long enough that I would've expected to see at least one such failure by now. They have a few failures from other unique sources, but I haven't seen anything like this yet.
I will say that it's possible the failure got missed on the LUCI builders because watchflakes
doesn't run against the LUCI infrastructure yet. At the very least I can do an ad-hoc-but-systematic analysis of all our LUCI darwin builds and see if anything crops up. If not, then that's an interesting data point.
Previously, the darwin
failures in #49692 were resolved by some combination of OS updates and perhaps better ventilation of the builder machines.
Obviously we don't have much control over the ventilation of Amazon's EC2 Mac Minis, though. It's unfortunate that so few Macs support ECC memory. 😒
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (
log ~ `fatal error: found pointer to free object` ||
log ~ `fatal error: found bad pointer in Go heap` ||
log ~ `fatal error: workbuf is not empty` ||
log ~ `fatal error: workbuf is empty` ||
log ~ `schedule: holding locks` ||
log ~ `fatal error: bad summary data` ||
log ~ `fatal: morestack on gsignal` ||
log ~ `runtime\.\(\*mheap\)\.allocNeedsZero` ||
(log ~ `panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`) ||
(log ~ `SIGSEGV: segmentation violation` && log ~ `^\s*runtime\.spanOfUnchecked\(`))
On the main repo, main branch, watchflakes doesn't find any failure matching this for most recent builds in 45 days. I could run it for longer time period and see.
Update: also no match on LUCI builders in a longer period of time.
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (
log ~ `fatal error: found pointer to free object` ||
log ~ `fatal error: found bad pointer in Go heap` ||
log ~ `fatal error: workbuf is not empty` ||
log ~ `fatal error: workbuf is empty` ||
log ~ `schedule: holding locks` ||
log ~ `fatal error: bad summary data` ||
log ~ `fatal: morestack on gsignal` ||
log ~ `runtime\.\(\*mheap\)\.allocNeedsZero` ||
(log ~ `panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`) ||
(log ~ `SIGSEGV: segmentation violation` && log ~ `^\s*runtime\.spanOfUnchecked\(`))
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (
log ~ `fatal error: found pointer to free object` ||
log ~ `fatal error: found bad pointer in Go heap` ||
log ~ `fatal error: workbuf is not empty` ||
log ~ `fatal error: workbuf is empty` ||
log ~ `schedule: holding locks` ||
log ~ `fatal error: bad summary data` ||
log ~ `fatal: morestack on gsignal` ||
log ~ `runtime\.\(\*mheap\)\.allocNeedsZero` ||
(log ~ `panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`) ||
(log ~ `SIGSEGV: segmentation violation` && log ~ `^\s*runtime\.spanOfUnchecked\(`))
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (
log ~ `fatal error: found pointer to free object` ||
log ~ `fatal error: found bad pointer in Go heap` ||
log ~ `fatal error: workbuf is not empty` ||
log ~ `fatal error: workbuf is empty` ||
log ~ `schedule: holding locks` ||
log ~ `fatal error: bad summary data` ||
log ~ `fatal: morestack on gsignal` ||
log ~ `runtime\.\(\*mheap\)\.allocNeedsZero` ||
(log ~ `panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`) ||
(log ~ `SIGSEGV: segmentation violation` && log ~ `^\s*runtime\.spanOfUnchecked\(`))
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (
log ~ `fatal error: found pointer to free object` ||
log ~ `fatal error: found bad pointer in Go heap` ||
log ~ `fatal error: workbuf is not empty` ||
log ~ `fatal error: workbuf is empty` ||
log ~ `schedule: holding locks` ||
log ~ `fatal error: bad summary data` ||
log ~ `fatal: morestack on gsignal` ||
log ~ `runtime\.\(\*mheap\)\.allocNeedsZero` ||
(log ~ `panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`) ||
(log ~ `panic: runtime error: invalid memory address or nil pointer dereference` && pkg == "cmd/compile/internal/ssa") ||
(log ~ `SIGSEGV: segmentation violation` && log ~ `^\s*runtime\.spanOfUnchecked\(`))
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (
log ~ `fatal error: found pointer to free object` ||
log ~ `fatal error: found bad pointer in Go heap` ||
log ~ `fatal error: workbuf is not empty` ||
log ~ `fatal error: workbuf is empty` ||
log ~ `schedule: holding locks` ||
log ~ `fatal error: bad summary data` ||
log ~ `fatal: morestack on gsignal` ||
log ~ `runtime\.\(\*mheap\)\.allocNeedsZero` ||
(log ~ `panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`) ||
(log ~ `panic: runtime error: invalid memory address or nil pointer dereference` && pkg == "cmd/compile/internal/ssa") ||
(log ~ `SIGSEGV: segmentation violation` && log ~ `^\s*runtime\.spanOfUnchecked\(`))
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && goarch == "amd64" && (
log ~ `fatal error: found pointer to free object` ||
log ~ `fatal error: found bad pointer in Go heap` ||
log ~ `fatal error: workbuf is not empty` ||
log ~ `fatal error: workbuf is empty` ||
log ~ `schedule: holding locks` ||
log ~ `fatal error: bad summary data` ||
log ~ `fatal: morestack on gsignal` ||
log ~ `runtime\.\(\*mheap\)\.allocNeedsZero` ||
(log ~ `panic: runtime error: invalid memory address or nil pointer dereference` && log ~ `^\s*runtime\.sigpanic.*\n\s+.*\n\s*runtime\.sweepone`) ||
(log ~ `panic: runtime error: invalid memory address or nil pointer dereference` && pkg == "cmd/compile/internal/ssa") ||
(log ~ `SIGSEGV: segmentation violation` && log ~ `^\s*runtime\.spanOfUnchecked\(`))
Issue created automatically to collect these failures.
Example (log):
— watchflakes