Open rsc opened 2 years ago
I've seen this a lot too, exact same status and the failed test passing on rerun. I see two place this occurs in source, one marked with a to-do. Other discussions mention this could be a bug as well.
TODO in source links to #48127 cmd/go: -keepfuzzing needs renaming, does not exist
@julieqiu @toothrot should this be tagged fuzz
?
(issue still present exactly as described as of 1.20.x)
I have a repro for this issue, as described by @rsc.
In my specific case usually fails within seconds during the gathering baseline coverage
phase.
In my case it looks like a race condition on coordinator/worker pipe:
workerClient
gets an EOF
error when the decoder tries to read from wc.fuzzOut
(at worker.go::callLocked
L1158)coordinate
L186 as err
(not printed) fuzzing process hung or terminated unexpectedly: exit status 2
@katiehockman @rolandshoemaker @jayconrod @bcmills : I see you authored most of the fuzzing code. Any chance you want to take a look? I don't want to share my repro here, but I'm happy to share privately.
@mprimi, note that of all the folks you've tagged only @rolandshoemaker and I are still on the Go team. 😅
(@golang/fuzzing is the right entity to tag for this sort of issue.)
@mprimi, I'm not sure I quite follow. What triggers the suspected race condition? (Is it caused by a worker that finds a crashing input, and finishes crashing before the coordinator has read that input?)
@bcmills
The original issue (by @rsc) describes a case where the fuzzer misbehaves and terminates with fuzzing process hung or terminated unexpectedly: exit status 2
.
It leaves behind a seed, but upon re-running the same, it passes. Something funny going on.
This is NOT a case of:
The issue described seems like a bug in the fuzzer itself where something goes wrong at the worker level, but it's not clear what.
I posted a response here because:
go test -v -fuzz=Decode image/gif
, no longer reproduces the issue) EOF
.If anyone from @golang/fuzzing is willing to take a look, I can share a (GitHub) link to a test that reproduces the issue reliably. (even if it's public code, I'll share that link privately, I don't want to link to my repro from this issue).
Any resolution or update on this? It's blocking my usage of go fuzzing
EDIT
Using -parallel=1
seemed to prevent the crashes, but that's significantly slower
I ran the test case through delve and set a breakpoint where the crash seems to originate.
1) Compile fuzz test into executable with go test -c -o test -fuzz MyFuzzTest -gcflags=all="-N -l" ./mytests/...
2) Run dlv exec ./test -- -test.v -test.fuzz MyFuzzTest -test.run MyFuzzTest -test.fuzzcachedir ./fuzz/cache
3) enter b /opt/homebrew/Cellar/go@1.20/1.20.11/libexec/src/internal/fuzz/worker.go:186
Frame 1: /opt/homebrew/Cellar/go@1.20/1.20.11/libexec/src/internal/fuzz/worker.go:186 (PC: 102cc84c4)
181: return err
182: }
183: // Unexpected termination. Set error message and fall through.
184: // We'll restart the worker on the next iteration.
185: // Don't attempt to minimize this since it crashed the worker.
=> 186: resp.Err = fmt.Sprintf("fuzzing process hung or terminated unexpectedly: %v", w.waitErr)
187: canMinimize = false
188: }
189: result := fuzzResult{
190: limit: input.limit,
191: count: resp.Count,
(dlv) print w.waitErr
error(*os/exec.ExitError) *{
ProcessState: *os.ProcessState {
pid: 50169,
status: 512,
rusage: *(*syscall.Rusage)(0x14000740000),},
Stderr: []uint8 len: 0, cap: 0, nil,}
(dlv) print err
error(*errors.errorString) *{s: "EOF"}
EDIT: OS info: Darwin arm64
I wonder if that issue is somehow architecture-specific? I hit it almost every minute on Mac mini with an Intel process, but can't replicate it with the same code on Apple silicon.
EDIT: never mind, it fails in the same way on Apple silicon, it just takes hours, not minutes
I think there are two related issues that get mixed up. The cause for the fuzzer crashing can either be:
I was able to extract the following stack trace for case 2 by using strace:
panic: deadlocked!
goroutine 19 [running]:
internal/fuzz.RunFuzzWorker.func1.1()
/usr/lib/go-1.21/src/internal/fuzz/worker.go:493 +0x25
created by time.goFunc
/usr/lib/go-1.21/src/time/sleep.go:176 +0x2d
After this crash the worker process exits with exit code 2 according to strace.
I created a workaround patch that avoids this crash and at least for image/gif
the fuzzer crashes go away.
This is the commit that introduced the behavior of crashing if a test case takes longer than 10s to execute.
We still need a real fix though to avoid the whole fuzzer to stop when this panic is thrown. There seems to be another bug somewhere. Potentially it has to do with the fact that the panic is thrown in a timer.
If you want to help test this workaround:
To summarize: This is not a bug but expected behavior until https://github.com/golang/go/issues/48157 is fixed. I would recommend closing this issue and continuing working on a PR for that.
Maybe we should improve the error message. We can add a field to the serialized data sent between the worker and the coordinator that indicates a hang. Mabye difficult though because we would need to interrupt the current worker and then return an error.
go test -v -fuzz=Decode image/gif
consistently produces output like:
Of course the test case changes each time, but when I rerun it using 'go test' the test case passes fine. This happens in the current Go dev branch too. I tried Go 1.19 to make sure it wasn't related to changes I have made to package testing.