Open bcmills opened 8 months ago
In triage, we think maybe this and the darwin memory corruption issues would be a good target for our friction fixit week time. We think the next step is to attempt to reproduce the issue in C and see if we need to escalate this as a kernel bug or something.
I randomly stumbled across #25696 today.
It isn't clear to me whether that is related, since it is using net
pipes instead of os
pipes. π€
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead`
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead`
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead`
β watchflakes
I think this is caused by an incomplete fix for #24164. We avoid adding FDs of type S_IFIFO
to the poller if they are marked as kindOpenFile
, but the FDs opened by os.Pipe
are marked as kindPipe
instead (even though they are also of type S_IFIFO
β see https://go.dev/play/p/6kSh4pJj7ef).
This is probably also related to the difference in unix.Select
behavior recently reported in https://groups.google.com/g/golang-nuts/c/wzh-nwH9dH8.
I'm not sure I agree with your conclusion.
Despite the similarity in inter-process communication, Pipes and FIFOs are fundamentally different: Pipes are created in memory, and their existence is transient while FIFOs persist as special files in the file system until they are removed. And kqueue
works according to the manual when it comes to Pipes:
When the last writer disconnects, the filter will set EV_EOF in flags. This may be cleared by passing in EV_CLEAR, at which point the filter will resume waiting ing for data to become available before returning.
The filter will set EV_EOF when the reader disconnects, and for the fifo case, this may be cleared by use of EV_CLEAR. Note that this filter is not supported for vnodes.
, which is why TestPipeEOF has been succeeding.
The only problem is FIFO, kqueue
doesn't seem to comply with its manuals when working with FIFOs, I assume that has something to do with the fact that FIFOs persist as special files in the file system. Thus, I'm inclined to blame this flake on macOS kernel.
@bcmills
Huh. Looking a bit more closely:
t.Fatal
instead of allowing itself to time out, it's difficult to tell.)So it seems plausible that either this is a composite of more than one bug, or perhaps a general race condition in the poller (not specific to pipes).
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead`
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead`
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead`
β watchflakes
I think this is caused by an incomplete fix for #24164. We avoid adding FDs of type
S_IFIFO
to the poller if they are marked askindOpenFile
, but the FDs opened byos.Pipe
are marked askindPipe
instead (even though they are also of typeS_IFIFO
β see https://go.dev/play/p/6kSh4pJj7ef).This is probably also related to the difference in
unix.Select
behavior recently reported in https://groups.google.com/g/golang-nuts/c/wzh-nwH9dH8.
Although a new patch that stops pipes being added into netpoll
like #24164 and the rationale behind it can't explain the bug here on the basis of my comment, but before we figure out what really happened with pipes in kqueue
, something like that can be a temporary solution that's a good chance to make these test flakes go away becausekqueue
doesn't look like to be stable when it comes to pipes.
I can draft a CL, in which we knock off pipes from netpoll
on macOS and leave a TODO
comment if we eventually decide to go for it.
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead`
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead`
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead`
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead` &&
!(builder ~ `(gotip|go1\.\d\d)-` && date < "2024-01-18")
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead` &&
!(builder ~ `(gotip|go1\.\d\d)-` && date < "2024-01-18")
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead` &&
!(builder ~ `(gotip|go1\.\d\d)-` && date < "2024-01-18")
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead` &&
!(builder ~ `(gotip|go1\.\d\d)-` && date < "2024-01-18")
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead` &&
!(builder ~ `(gotip|go1\.\d\d)-` && date < "2024-01-18")
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead` &&
!(builder ~ `(gotip|go1\.\d\d)-` && date < "2024-01-18")
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead` &&
!(builder ~ `(gotip|go1\.\d\d)-` && date < "2024-01-18")
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead` &&
!(builder ~ `(gotip|go1\.\d\d)-` && date < "2024-01-18")
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead` &&
!(builder ~ `(gotip|go1\.\d\d)-` && date < "2024-01-18")
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead` &&
!(builder ~ `(gotip|go1\.\d\d)-` && date < "2024-01-18")
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead` &&
!(builder ~ `(gotip|go1\.\d\d)-` && date < "2024-01-18")
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead` &&
!(builder ~ `(gotip|go1\.\d\d)-` && date < "2024-01-18")
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead` &&
!(builder ~ `(gotip|go1\.\d\d)-` && date < "2024-01-18")
β watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "darwin" && log ~ `os/exec\.\(\*Cmd\)\.awaitGoroutines` && log ~ `internal/poll\.\(\*pollDesc\)\.waitRead` &&
!(builder ~ `(gotip|go1\.\d\d)-` && date < "2024-01-18")
β watchflakes
We have been tracking a long-running bug on macOS (
darwin
) in https://github.com/golang/go/issues/54461#issuecomment-1591922415. That issue title doesn't accurately capture the widespread nature of the problem, nor its defining symptom or suspected root cause; so, I am filing a new issue for it.On macOS, we see a pattern of test failures with the following characteristics:
read
operation on a pipe.Many of these bugs have been worked around by explicitly canceling (or adding a timeout to) the stuck read.
The issues with this symptom have included:
60099
54461
61073
61128
A few other issues may be related but don't (or didn't) include goroutine dumps for the stuck process: they may or may not be due to this failure mode.
56231
57369
59657
61595
61779
63258
63731
64110
64269
64700
64959
Given the symptoms, it seems likely to me that this is either a bug in
internal/poll
, or a bug in the macOS platform that we have somehow started to trigger.Note that #61779 was bisected specifically to https://go.dev/cl/420334.