Emit CALL to custom afl_maybe_log() on x86_64

oleavr commented 3 years ago

This is a fun experiment to avoid the full context-switch that put_callout() implies, where we instead CALL a nearby custom version of afl_maybe_log(), modified to not clobber any flags or registers.

I don't know yet if this actually results in a worthwhile speed-up.

The reason I didn't fully inline the function is that I suspect it might be more likely to end up in the CPU cache, and there's less added bloat per basic block. It might be possible to reduce that further, but I didn't want to hurt my brain too much in this first attempt 😅

vanhauser-thc commented 3 years ago

what was the performance gain of this PR?

meme commented 3 years ago

~3x on my Heartbleed benchmark. We are 50% of the speed of afl-clang-fast (6k exec/sec vs 12k exec/sec)

vanhauser-thc commented 3 years ago

I imported it to our util/afl_frida implementations that is based on yours and it got 2% speed increase on the test-lib testcase. that testcase is at 60% of the afl-clang-fast variant. a real target should have more speed gain as with such a little amount of edges the patch has little impact.

btw afl-clang-fast variant only traverses 7 edges, the frida one 47. Not sure why there are many more edges in frida. When I try the testcase on qemu_mode it has 32 edges - but has an additional main with read() if (len ...) etc.

@oleavr could that be a bug in frida?

oleavr commented 3 years ago

@vanhauser-thc

btw afl-clang-fast variant only traverses 7 edges, the frida one 47. Not sure why there are many more edges in frida. When I try the testcase on qemu_mode it has 32 edges - but has an additional main with read() if (len ...) etc.

Interesting. I believe it's because of this:

    gum_stalker_follow_me(stalker, transformer, NULL);

    while (__afl_persistent_loop(1000000) != 0) {
        len = read(STDIN_FILENO, data, sizeof(data) - 1);
        if (len > 0) {
          data[len] = 0;
          box(data);
        }
    }

    gum_stalker_unfollow_me(stalker);

Where there's quite a bit of code between follow_me() and box(). I.e. the basic blocks of this code, __afl_persistent_loop(), and read(), are instrumented in addition to box().

vanhauser-thc commented 3 years ago

@vanhauser-thc

btw afl-clang-fast variant only traverses 7 edges, the frida one 47. Not sure why there are many more edges in frida. When I try the testcase on qemu_mode it has 32 edges - but has an additional main with read() if (len ...) etc.

Interesting. I believe it's because of this:
    gum_stalker_follow_me(stalker, transformer, NULL);

    while (__afl_persistent_loop(1000000) != 0) {
        len = read(STDIN_FILENO, data, sizeof(data) - 1);
        if (len > 0) {
          data[len] = 0;
          box(data);
        }
    }

    gum_stalker_unfollow_me(stalker);
Where there's quite a bit of code between follow_me() and box(). I.e. the basic blocks of this code, __afl_persistent_loop(), and read(), are instrumented in addition to box().

makes totally sense :) Is there a way to exclude __afl_persistent_loop() part without having a performance impact?

oleavr commented 3 years ago

Is there a way to exclude __afl_persistent_loop() part without having a performance impact?

It's possible to use gum_stalker_deactivate(stalker) right after gum_stalker_follow_me(...), and gum_stalker_activate(stalker, box) just before box(data), and gum_stalker_deactivate(stalker) right after it. However, these other things are probably worth doing before that:

Tell Stalker to assume code won't mutate by calling gum_stalker_set_trust_threshold(stalker, 0) – in cases where one knows there's no self-modifying code. This means branches between basic blocks can be optimized right away so there's no overhead in most cases. Indirect branches may be expensive in case of inline cache misses, but that's something we should investigate and improve when discovered on real-world targets. (This code has barely been tuned after the inline caching was implemented, so I suspect there's still some low-hanging fruits there.)
Make sure box() and the code to be instrumented is in a different module from this code, so that the transformer callback in basic_block.c can easily avoid adding instrumentation to it. Hotwax doesn't do this currently, as hotwax_lib is a static library.
If feasible, use gum_stalker_exclude(stalker, range) to carefully tell Stalker that certain ranges are "boring" so any CALL instructions going there should be used as-is (just relocated) instead of following execution there. This might be the libc, for example, but if you know that you only care about code inside some library, you can exclude the two ranges before and after that library (i.e. [0, library_start), [library_end, G_MAXSIZE]).

Sorry for the lack of documentation by the way – Stalker needs a lot more docs :)

meme / hotwax

Emit CALL to custom afl_maybe_log() on x86_64 #11