golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.82k stars 17.65k forks source link

testing: fuzz reports spurious "process hung or terminated unexpectedly" errors #56238

Open rsc opened 2 years ago

rsc commented 2 years ago

go test -v -fuzz=Decode image/gif

consistently produces output like:

...
=== RUN   TestEncodeWrappedImage
--- PASS: TestEncodeWrappedImage (0.06s)
=== FUZZ  FuzzDecode
fuzz: elapsed: 0s, gathering baseline coverage: 0/5 completed
fuzz: elapsed: 3s, gathering baseline coverage: 2/5 completed
fuzz: elapsed: 6s, gathering baseline coverage: 4/5 completed
fuzz: elapsed: 9s, gathering baseline coverage: 5/5 completed, now fuzzing with 16 workers
fuzz: elapsed: 9s, execs: 25 (7/sec), new interesting: 0 (total: 5)
fuzz: elapsed: 12s, execs: 26 (0/sec), new interesting: 0 (total: 5)
fuzz: elapsed: 15s, execs: 28 (1/sec), new interesting: 0 (total: 5)
fuzz: elapsed: 18s, execs: 28 (0/sec), new interesting: 0 (total: 5)
fuzz: elapsed: 20s, execs: 29 (1/sec), new interesting: 0 (total: 5)
--- FAIL: FuzzDecode (19.88s)
    fuzzing process hung or terminated unexpectedly: exit status 2
    Failing input written to testdata/fuzz/FuzzDecode/a601188e726161766a8b077333fe62493bc92397a4c3bf5b205ac46814029453
    To re-run:
    go test -run=FuzzDecode/a601188e726161766a8b077333fe62493bc92397a4c3bf5b205ac46814029453
FAIL

Of course the test case changes each time, but when I rerun it using 'go test' the test case passes fine. This happens in the current Go dev branch too. I tried Go 1.19 to make sure it wasn't related to changes I have made to package testing.

ZackaryWelch commented 1 year ago

I've seen this a lot too, exact same status and the failed test passing on rerun. I see two place this occurs in source, one marked with a to-do. Other discussions mention this could be a bug as well.

ZackaryWelch commented 1 year ago

TODO in source links to #48127 cmd/go: -keepfuzzing needs renaming, does not exist

mprimi commented 1 year ago

@julieqiu @toothrot should this be tagged fuzz? (issue still present exactly as described as of 1.20.x)

mprimi commented 1 year ago

I have a repro for this issue, as described by @rsc.

In my specific case usually fails within seconds during the gathering baseline coverage phase.

In my case it looks like a race condition on coordinator/worker pipe:

@katiehockman @rolandshoemaker @jayconrod @bcmills : I see you authored most of the fuzzing code. Any chance you want to take a look? I don't want to share my repro here, but I'm happy to share privately.

bcmills commented 1 year ago

@mprimi, note that of all the folks you've tagged only @rolandshoemaker and I are still on the Go team. 😅

(@golang/fuzzing is the right entity to tag for this sort of issue.)

bcmills commented 1 year ago

@mprimi, I'm not sure I quite follow. What triggers the suspected race condition? (Is it caused by a worker that finds a crashing input, and finishes crashing before the coordinator has read that input?)

mprimi commented 1 year ago

@bcmills The original issue (by @rsc) describes a case where the fuzzer misbehaves and terminates with fuzzing process hung or terminated unexpectedly: exit status 2. It leaves behind a seed, but upon re-running the same, it passes. Something funny going on.

This is NOT a case of:

The issue described seems like a bug in the fuzzer itself where something goes wrong at the worker level, but it's not clear what.

I posted a response here because:

If anyone from @golang/fuzzing is willing to take a look, I can share a (GitHub) link to a test that reproduces the issue reliably. (even if it's public code, I'll share that link privately, I don't want to link to my repro from this issue).

0xalpharush commented 11 months ago

Any resolution or update on this? It's blocking my usage of go fuzzing

EDIT Using -parallel=1 seemed to prevent the crashes, but that's significantly slower

I ran the test case through delve and set a breakpoint where the crash seems to originate. 1) Compile fuzz test into executable with go test -c -o test -fuzz MyFuzzTest -gcflags=all="-N -l" ./mytests/... 2) Run dlv exec ./test -- -test.v -test.fuzz MyFuzzTest -test.run MyFuzzTest -test.fuzzcachedir ./fuzz/cache 3) enter b /opt/homebrew/Cellar/go@1.20/1.20.11/libexec/src/internal/fuzz/worker.go:186

Frame 1: /opt/homebrew/Cellar/go@1.20/1.20.11/libexec/src/internal/fuzz/worker.go:186 (PC: 102cc84c4)
   181:                                         return err
   182:                                 }
   183:                                 // Unexpected termination. Set error message and fall through.
   184:                                 // We'll restart the worker on the next iteration.
   185:                                 // Don't attempt to minimize this since it crashed the worker.
=> 186:                                 resp.Err = fmt.Sprintf("fuzzing process hung or terminated unexpectedly: %v", w.waitErr)
   187:                                 canMinimize = false
   188:                         }
   189:                         result := fuzzResult{
   190:                                 limit:         input.limit,
   191:                                 count:         resp.Count,
(dlv) print w.waitErr
error(*os/exec.ExitError) *{
        ProcessState: *os.ProcessState {
                pid: 50169,
                status: 512,
                rusage: *(*syscall.Rusage)(0x14000740000),},
        Stderr: []uint8 len: 0, cap: 0, nil,}
(dlv) print err
error(*errors.errorString) *{s: "EOF"}

EDIT: OS info: Darwin arm64

AlekSi commented 9 months ago

I wonder if that issue is somehow architecture-specific? I hit it almost every minute on Mac mini with an Intel process, but can't replicate it with the same code on Apple silicon.

EDIT: never mind, it fails in the same way on Apple silicon, it just takes hours, not minutes

maxammann commented 8 months ago

I think there are two related issues that get mixed up. The cause for the fuzzer crashing can either be:

  1. OOM which results in the client worker getting killed (check your dmesg output for OOM issues). Likely that is the issue here.
  2. The client worker experiences a panic and exits with exit code 2.

I was able to extract the following stack trace for case 2 by using strace:

panic: deadlocked!

goroutine 19 [running]:
internal/fuzz.RunFuzzWorker.func1.1()
    /usr/lib/go-1.21/src/internal/fuzz/worker.go:493 +0x25
created by time.goFunc
    /usr/lib/go-1.21/src/time/sleep.go:176 +0x2d

After this crash the worker process exits with exit code 2 according to strace. I created a workaround patch that avoids this crash and at least for image/gif the fuzzer crashes go away.

This is the commit that introduced the behavior of crashing if a test case takes longer than 10s to execute.

We still need a real fix though to avoid the whole fuzzer to stop when this panic is thrown. There seems to be another bug somewhere. Potentially it has to do with the fact that the panic is thrown in a timer.

maxammann commented 8 months ago

If you want to help test this workaround:

maxammann commented 8 months ago

To summarize: This is not a bug but expected behavior until https://github.com/golang/go/issues/48157 is fixed. I would recommend closing this issue and continuing working on a PR for that.

Maybe we should improve the error message. We can add a field to the serialized data sent between the worker and the coordinator that indicates a hang. Mabye difficult though because we would need to interrupt the current worker and then return an error.