Closed oleavr closed 3 years ago
what was the performance gain of this PR?
~3x on my Heartbleed benchmark. We are 50% of the speed of afl-clang-fast (6k exec/sec vs 12k exec/sec)
I imported it to our util/afl_frida
implementations that is based on yours and it got 2% speed increase on the test-lib testcase.
that testcase is at 60% of the afl-clang-fast variant.
a real target should have more speed gain as with such a little amount of edges the patch has little impact.
btw afl-clang-fast variant only traverses 7 edges, the frida one 47. Not sure why there are many more edges in frida. When I try the testcase on qemu_mode it has 32 edges - but has an additional main with read() if (len ...) etc.
@oleavr could that be a bug in frida?
@vanhauser-thc
btw afl-clang-fast variant only traverses 7 edges, the frida one 47. Not sure why there are many more edges in frida. When I try the testcase on qemu_mode it has 32 edges - but has an additional main with read() if (len ...) etc.
Interesting. I believe it's because of this:
gum_stalker_follow_me(stalker, transformer, NULL);
while (__afl_persistent_loop(1000000) != 0) {
len = read(STDIN_FILENO, data, sizeof(data) - 1);
if (len > 0) {
data[len] = 0;
box(data);
}
}
gum_stalker_unfollow_me(stalker);
Where there's quite a bit of code between follow_me()
and box()
. I.e. the basic blocks of this code, __afl_persistent_loop()
, and read()
, are instrumented in addition to box()
.
@vanhauser-thc
btw afl-clang-fast variant only traverses 7 edges, the frida one 47. Not sure why there are many more edges in frida. When I try the testcase on qemu_mode it has 32 edges - but has an additional main with read() if (len ...) etc.
Interesting. I believe it's because of this:
gum_stalker_follow_me(stalker, transformer, NULL); while (__afl_persistent_loop(1000000) != 0) { len = read(STDIN_FILENO, data, sizeof(data) - 1); if (len > 0) { data[len] = 0; box(data); } } gum_stalker_unfollow_me(stalker);
Where there's quite a bit of code between
follow_me()
andbox()
. I.e. the basic blocks of this code,__afl_persistent_loop()
, andread()
, are instrumented in addition tobox()
.
makes totally sense :) Is there a way to exclude __afl_persistent_loop() part without having a performance impact?
Is there a way to exclude __afl_persistent_loop() part without having a performance impact?
It's possible to use gum_stalker_deactivate(stalker)
right after gum_stalker_follow_me(...)
, and gum_stalker_activate(stalker, box)
just before box(data)
, and gum_stalker_deactivate(stalker)
right after it. However, these other things are probably worth doing before that:
gum_stalker_set_trust_threshold(stalker, 0)
– in cases where one knows there's no self-modifying code. This means branches between basic blocks can be optimized right away so there's no overhead in most cases. Indirect branches may be expensive in case of inline cache misses, but that's something we should investigate and improve when discovered on real-world targets. (This code has barely been tuned after the inline caching was implemented, so I suspect there's still some low-hanging fruits there.)box()
and the code to be instrumented is in a different module from this code, so that the transformer callback in basic_block.c
can easily avoid adding instrumentation to it. Hotwax doesn't do this currently, as hotwax_lib
is a static library.gum_stalker_exclude(stalker, range)
to carefully tell Stalker that certain ranges are "boring" so any CALL instructions going there should be used as-is (just relocated) instead of following execution there. This might be the libc, for example, but if you know that you only care about code inside some library, you can exclude the two ranges before and after that library (i.e. [0, library_start)
, [library_end, G_MAXSIZE]
).Sorry for the lack of documentation by the way – Stalker needs a lot more docs :)
This is a fun experiment to avoid the full context-switch that
put_callout()
implies, where we instead CALL a nearby custom version ofafl_maybe_log()
, modified to not clobber any flags or registers.I don't know yet if this actually results in a worthwhile speed-up.
The reason I didn't fully inline the function is that I suspect it might be more likely to end up in the CPU cache, and there's less added bloat per basic block. It might be possible to reduce that further, but I didn't want to hurt my brain too much in this first attempt 😅