Don't propagate `TraceeDied` errors when recording coverage

ranweiler commented 3 years ago

When recording coverage or input testing on Linux, we must always be prepared for any tracing operation to fail due to an unreported or not-yet-reported tracee task exit. Right now, we sometimes treat acceptable tracing errors as hard failures, or treat unacceptable task results as successful.

Current examples:

Not handled, but should be: https://github.com/microsoft/onefuzz/blob/aa4ed2893e3dcb656eab141c68af7dd377b48475/src/agent/coverage/src/block/linux.rs#L86
Handled correctly (but should maybe only log a warning): https://github.com/microsoft/onefuzz/blob/aa4ed2893e3dcb656eab141c68af7dd377b48475/src/agent/coverage/src/block/linux.rs#L108-L110

The call to Ptracer::wait() should change to look like the second case, because it internally may invoke ptrace(2) and fail with ESRCH due to tracee exit (it does more than just wait(2)).

However, in all the tasks which trace targets, we must take care to to identify cases where tracing failed logically, but the tracing functions did not literally return an error. This can be checked heuristically by e.g. ensuring recorded coverage is nonzero, ensuring that at least one tracee process was created / some syscalls were invoked, &c.

Note that pete now returns a TraceeDied variant in the exact cases where we only want to warn on error, then continue and validate that task results were nontrivial.

AB#35975

mgreisen commented 1 year ago

@ranweiler, is this still an issue?

ranweiler commented 1 year ago

@mgreisen, for context, this is a Linux-only edge case that is currently mitigated by retries, but it still valid.

For OneFuzz to hit this, for a single input, a (Linux) task would have to repeatedly have to have its target tracee killed by an external process while it is in a ptrace-stop state. In the context of OneFuzz, this should never happen.

The improvement we can make is, when wait()-ing on ptrace stops here, check to see if the error variant is TraceeDied, and return Ok(()) if so. Otherwise, propagate the error.

I'll re-assign this to myself to implement that change.

The other half of this is specific to the OneFuzz task worker, and has been split out in #2926.

microsoft / onefuzz

Don't propagate `TraceeDied` errors when recording coverage #1044