Investigate high fuzzer overhead

saelo commented 3 years ago

Since commit 1408aab353b3a7f54b5a4e1b4471e054d615adcf, Fuzzilli computes and displays the "fuzzer overhead", i.e. the fraction of time that is not spent executing JavaScript code in the target engine. Normal values seem to be roughly between 5% and 15%. However, in long fuzzing sessions and seemingly especially in multithreaded mode (e.g. --jobs=32), this number can become quite significant (approaching 50%). This should be investigated.

Zon8Research commented 3 years ago

I also get high overhead after a day or so when using --jobs. Is there anything I can do to help debug this?

Fuzzer Overhead:              76.39%

saelo commented 3 years ago

You'd probably need to use some kind of profiler (e.g. perf on Linux) to figure out where the CPU time is spent, and if there's a bug there that we can fix.

Alternatively, you can use network synchronization (and maybe a low --jobs number on each node), which seems to not suffer from the problem that much (--jobs is still marked as "experimental" due to this issue).

WilliamParks commented 3 years ago

I did some initial investigation a couple weeks ago, using perf to trace a long running session.

The largest win was switching to vfork from fork in libreprl. I'm not 100% on my understanding of why this was happening, but the kernel seemed to be taking more and more time to fork as memory usage increased when jobs was high (64 in my case). This seemed to reduce the high overhead for long running sessions with high job counts.

Fuzzilli was also spending a significant amount of compute in the JavascriptLifter, in inlining and determining which variables should be let vs const. I'm not sure if removing these reduce the overall effectiveness of Fuzzilli, however.

saelo commented 3 years ago

Oh wow, great find! Yeah, using vfork does indeed seem to cause a considerable improvement in performance, and I guess it should be fine to use since the child process doesn't modify any global memory before calling execve (afaik, the only difference between the two on Linux is that page tables aren't modified/duplicated). My initial guess as to why this gives such a huge boost would be that the kernel has to take some lock related to page tables when performing a fork, which then probably causes many of the other fuzzing threads to block on it. Then once a fair number of JIT related samples are in the corpus, the number of timouts (e.g. due to infinite loops), and subsequently the number of child process restarts becomes large enough for this to be an issue. But I'm just guessing here.

I'll put together a PR to switch to vfork on Linux. I think we can keep using fork everywhere else though for now, since it's probably not too important and e.g. the macOS man page for vfork states

ERRORS
     The vfork() system call will fail for any of the reasons described in the fork man page.  In addition, it will fail if:

     [EINVAL]           A system call other than _exit() or execve() (or libc functions that make no system calls other than those) is called following calling a vfork() call.

googleprojectzero / fuzzilli

Investigate high fuzzer overhead #174