hghwng commented 3 years ago

Summary

afl-fuzz spends lots of time processing coverage. Because most of the counters processed are zero and most of the executions do not need to be saved, most logic performed by AFL is redundant. In my measurements, the coverage pipeline can occupy up to 90% of afl-fuzz's user space time.

This pull request accelerates the most time-consuming part of AFL with a 3-tiered design. On coverage files collected from libxml2, the modified algorithm achieves a speedup of 54%. With AVX2 and AVX512, the speedup increases to 388% and 603% respectively.

See experiments for microbenchmark.

Background

Most executions do not yield any result in fuzzing. However, afl-fuzz still processes the coverage in two passes. The first pass classifies the raw counters to bitmaps (see run_target), and the second pass scans for new program behavior (has_new_bits).

The first pass's output, i.e., the classified trace_bits, is seldom used beyond has_new_bits. The memory writes to bitmap is wasted in most cases.

The second pass involves control flow transfers (ret) and side effects (updates to virgin). Therefore, compilers cannot efficiently optimize the hot path with loop vectorization.

Design

This pull request accelerates the most time-consuming part of AFL with a 3-tiered design. The lower tier is simpler, enable more optimization opportunities. The higher tier is slower, but can handle more complicated cases.

For example, on CPUs with AVX512:

For the most frequent part (zeros), the vectorized scan runs processes 64 bytes (8 * u64) per iteration.
For a nonzero zmm, the comparison mask is used to locate the nonzero u64s inside the zmm. The nonzero parts is read, classified, and compared with the virgin map.
Only when the above hot path fails, the original AFL's algorithm is invoked to process current zmm. When the next zmm is ready (we cannot switch immediately because of memory alignment requirements), the slow path switches back to the fast path.

Implementation

The first commit refactors base coverage-processing routines into two new files. The 32-bit and 64-bit versions are separated for clarity.

The second commit disables classification in hot path. For save_if_interesting, the classification of trace bits is disabled. For other cold functions which obtain coverage, the classification is preserved.

The last commit completes the hot path with acceleration. A new function, has_new_bits_unclassified is introduced. This function behaves like classify_counts + has_new_bits, with redundant computation removed and vectorization added.

Result

Configuration: Clang 11.0.0, Xeon Gold 6148 (SKX)

Simulation: 30 new coverage, repeat each for 25,614 times (in my case the discovery rate is 1 input per 25,614 execution)

   938 ms: AVX512  (-O3 -march=skylake-server)
 1,351 ms: AVX2    (-O3 -march=skylake)
 3,623 ms: NoSIMD  (-O3)
 6,597 ms: Vanilla (-O3 -DSLOW)

µarch Analysis

# llvm-mca perf.s --iterations=10000 -mcpu=skylake-avx512

[0] Code Region

Iterations:        100000
Instructions:      700000
Total Cycles:      187049
Total uOps:        800000

Dispatch Width:    6
uOps Per Cycle:    4.28
IPC:               3.74
Block RThroughput: 1.3

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        addq  $64, %rax
 1      1     0.25                        cmpq  %rax, %r11
 1      1     0.50                        je    .LBB3_11
 2      8     0.50    *                   vmovdqa64 (%r8,%rax), %zmm0
 1      4     1.00                        vptestnmq %zmm0, %zmm0, %k0
 1      3     1.00                        kortestb  %k0, %k0
 1      1     0.50                        jb    .LBB3_18

Resources:
[0]   - SKXDivider
[1]   - SKXFPDivider
[2]   - SKXPort0
[3]   - SKXPort1
[4]   - SKXPort2
[5]   - SKXPort3
[6]   - SKXPort4
[7]   - SKXPort5
[8]   - SKXPort6
[9]   - SKXPort7

Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    
 -      -     1.76   1.74   0.50   0.50    -     1.74   1.76    -     

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -     0.02   0.35    -      -      -     0.33   0.30    -     addq  $64, %rax
 -      -      -     0.78    -      -      -     0.13   0.09    -     cmpq  %rax, %r11
 -      -     0.24    -      -      -      -      -     0.76    -     je    .LBB3_11
 -      -     0.11   0.61   0.50   0.50    -     0.28    -      -     vmovdqa64 (%r8,%rax), %zmm0
 -      -      -      -      -      -      -     1.00    -      -     vptestnmq %zmm0, %zmm0, %k0
 -      -     1.00    -      -      -      -      -      -      -     kortestb  %k0, %k0
 -      -     0.39    -      -      -      -      -     0.61    -     jb    .LBB3_18

vanhauser-thc commented 3 years ago

There seems to be a bug in the code I tried https://github.com/hghwng/AFL/tree/skim against a target that has both 100% stability with google/afl and afl++. This however degrades to 85%- so there is an issue somewhere.

vanhauser-thc commented 3 years ago

Also I did not notice an impact on the execs_done when run for the same time. however that could be because of extra code paths because of falsely detected variable paths

hghwng commented 3 years ago

@vanhauser-thc Thanks for the note! My own experimental code is messy, and I had a difficult time organizing it for this PR. Now the instability bug is fixed.

As for "execs_done", can you please show your experiment setup? In my case, I ran the vectorized version and the baseline version for 10s with UI disabled. The overall speedup ranges from 5% to 20% with AVX2 enabled. More speedup can be observed if the coverage region is sparse or AVX512 is enabled.

The rationale behind short execution time is reducing randomness introduced by other parts in AFL. Despite "execs_done" is an intuitive and conclusive metric, it is determined by (a) the execution speed of target program and (b) AFL itself. In my observation, the vectorized version is always faster in the beginning, thus it discovers more seeds than the baseline version. However, because the initial seeds are simple, the discovered new seeds are usually slower to execute. Therefore, more time is spent at the target program, and the vectorized version can be much more slower on some specific seeds.

Consequently, I prefer micro-benchmarking because of the isolation of noises.

vanhauser-thc commented 3 years ago

I can confirm this fixes the stability issue. my testing was more broad to see if it has a real impact on fuzzing. I believe you that it is always more performant, though the impact on the fuzzing performance seems to be very small. But nonetheless - more speed is always better. how do you think this will perform on ARMv5 with NEON and AARCH64 ? or AMD Ryzen processors? better, same or worse?

jonathanmetzman commented 3 years ago

Thanks! I'll try to review this soon, but I'm not sure I can land this before the new year due the holidays. etc.

hghwng commented 3 years ago

@jonathanmetzman No hurry for landing it -- have a good time on your holidays :smile: Looking forward to your review!

vanhauser-thc commented 3 years ago

@hghwng I added your patch to afl++, although I had to made several adjustments as we have dynamic sized trace maps. will be merged the next days.

I did a fuzzbench run https://www.fuzzbench.com/reports/experimental/2020-12-18/index.html which tested with (among 2 other things) and looks like also overall it slightly improved.

Not sure how this performs on ARM systems though.

hghwng commented 3 years ago

@vanhauser-thc Thanks for the comprehensive evaluation on FuzzBench! As for the performance on ARM systems, I cannot give you a definite number due to the lack of hardware. But from a theoretical perspective, I think the performance gain persists. My patch is more than vectorization. It uses a one-pass design for most cases, removing redundant computation of the original two-pass algorithm. As the data of "NoSIMD" vs "Vanilla" shows, the performance gain can be still observed even without SIMD support.

hghwng commented 3 years ago

@mboehme Big thanks for your insightful comments! I've refined the documentation (see ebedb33) following your suggestion. As for the implementation of classify_word, I prefer the memcpy version.

As Compiler Explorer shows, the generated assembly is identical for both clang and gcc under O3. In fact, no memcpy calls are emitted by both clang and gcc without enabling any optimization. What's more interesting, with Compiler Explorer's default configuration (gcc -O2), llvm-mca estimates that the memset version only requires 1.61 cycles per iter, while the pointer version requires 2.09 cycles. Quick Bench also confirms a ~1% performance gain of the memset version.

The memset version is more optimization-friendly. Modern compilers replace libc's memcpy to their own internal versions for optimization opportunities (clang, gcc). In this case, the fixed-sized memcpy invocations are easily replaced with direct memory references and optimized with mem2reg passes.

However, the pointer-based version has side effects and requires pointer analysis for the compiler to correctly optimize. For example, the compiler must know that word[0] and word[1] are not aliased pointers, thus writing to word[0] does not affect the value of word[1]. In this case, while the argument (word) is passed as a register (rdi), the result is spilled to stack (rsp-4 and rsp-2) to take the writing address.

Personally, I like the memset version because it looks pretty :-p

mboehme commented 3 years ago

Interesting! I changed your four casts to a single cast and can confirm for -O3 in Compiler Explorer the assembly is equivalent for GCC 10.1 and Clang 10. The performance gain in quickbench is spurious. When I switch the order of the calls to BENCHMARK, the naive version comes out about 1% faster.

Without optimizations (-O0), the naive version is 20% faster on GCC 10.1 and 10% faster on Clang 10. This difference remains when I switch the order of calls to BENCHMARK.

vanhauser-thc commented 3 years ago

yes memcpy is inlined unless compiled otherwise - however some people are compiling with old gcc versions, or even non-gcc, e.g intel's icc etc. so that it is always inlined is not a given.

IMHO the memcpy is fine because we want mainly to cater to people using the most current compilers, while making sure it still works for everyone, even on outdated systems. However on outdated systems people should not expect t have the top performance anyway.

vanhauser-thc commented 3 years ago

Interesting! I changed your four casts to a single cast and can confirm for -O3 in Compiler Explorer the assembly is equivalent for GCC 10.1 and Clang 10. The performance gain in quickbench is spurious. When I switch the order of the calls to BENCHMARK, the naive version comes out about 1% faster.

Without optimizations (-O0), the naive version is 20% faster on GCC 10.1 and 10% faster on Clang 10. This difference remains when I switch the order of calls to BENCHMARK.

gcc is so over-optimized that simple changes can have a huge impact on the assembly generated. llvm is way less surprising. it usually has a tiny bit less performance however I have never seen that it removes security checks - which I have experiences with gcc ...

mboehme commented 3 years ago

Can also confirm the performance of naive and memset version for -O3 -- from the oldest versions of Clang (3.8) and GCC (5.5) that are available on quickbench -- is equivalent.

hghwng commented 3 years ago

@vanhauser-thc The fact that security checks can be optimized out is especially interesting! Can you show some example?

@mboehme Do you think I can mark these memcpy-related comments as resolved? BTW I speculate that the performance impact behind reordering can be explained by changed code layout (e.g. 64B window of Decode Stream Buffer). When doing micro benchmarks, it's always difficult to tell the exact result unless you get down to micro-architectural analysis.

mboehme commented 3 years ago

Do you think I can mark these memcpy-related comments as resolved?

Sure. Fine with me. I like your PR.

I'm wondering: You are still classifying all counts, right? If there is no difference, during skimming, you classify all counts. If there is a difference, during skimming, you classify some counts, fall back "to the old logic" and call classify_counts to classify all counts. Couldn't this be further optimized by a single call to classify_counts before skimming?

hghwng commented 3 years ago

Sure. Fine with me. I like your PR.

Thank you! I also like your work on improving the directed fuzzing algorithms, which accelerates fuzzing from another interesting and important perspective.

I'm wondering: You are still classifying all counts, right? If there is no difference, during skimming, you classify all counts. If there is a difference, during skimming, you classify some counts, fall back "to the old logic" and call classify_counts to classify all counts. Couldn't this be further optimized by a single call to classify_counts before skimming?

Classifying the counter first still needs two passes. Even with skimming enabled, the second pass still requires re-reading the trace bits. Because memory access is much slower than register (used by this PR), I think classifying the count first can be slower. As for the extra work performed by this PR, the acceleration in hot path outweighs the cost of fallback in general as the FuzzBench result demonstrates.

vanhauser-thc commented 3 years ago

This PR results in segfaults when -march=native is used on Intel Xeon CPUs. super weird. I dont fault this code but the optimization. It happens with clang and gcc though.

hghwng commented 3 years ago

@vanhauser-thc I speculate that the alignment (64 bytes) requirement is not satisfied. I'm not familiar with how AFL++ allocates trace bits, but please ensure the data layout with compiler hints and APIs such as __attribute__((aligned)) and aligned_alloc. For AFL, the shmat-based allocation aligns at page level.

vanhauser-thc commented 3 years ago

@hghwng I thought it is 32 byte, but if you say 64 then this explains why it fails.

google / AFL

Accelerate coverage processing with hot-path vectorization #126

Summary

Background

Design

Implementation

Result

µarch Analysis