Closed pd-fkie closed 3 years ago
Interesting, this is excellent. We considered using a similar approach, via a modified CPython runtime rather than after-the-fact bytecode editing. We ultimately decided on an extension in an attempt to make Atheris as easy-to-use as possible. (It's already complicated enough.)
A 5x performance improvement is quite significant. Do you know if it's reasonable to transform code objects at runtime rather than by modifying .py / .pyc files? I assume so. If so, that would probably be the best way to do this. That way, Atheris can still be an extension providing any shared code, but coverage calls can be added where needed without a tracer.
This also means we can support more interesting tracing (like string comparison) in Python versions before 3.8, which would be excellent. I was considering implementing better regular expression coverage via monkey-patching anyway.
Could you explain more about what you meant by "It does not support...differential fuzzing since the counters of each module start at 0"?
Transforming code objects of modules is perfectly possible.
I added a new method instrument(module)
which gets a module object
as a parameter, retrieves the code object of the module, instruments the code
and creates a new module with the same name but with the instrumented code.
This can look e.g. like this:
import sys
import pefile
import lf
def test_one_input(data):
...
if __name__ == "__main__":
lf.instrument(pefile) # <-- NEW
lf.fuzz(test_one_input, sys.argv)
(See the latest version of my POC)
Could you explain more about what you meant by "It does not support...differential fuzzing since the counters of each module start at 0"?
I meant that there cannot be more than one instrumented module in the module list at a time. But this problem has been solved by the in-memory instrumentation.
Update: After adding instrumentation for data-flow tracing the performance benefit dropped from 5x to 2x.
Just wanted to update you, this is on our roadmap :)
Question: would you like to merge this into Atheris, or would you prefer we do it? If you want to do it, could you provide a well-documented PR?
At the moment I wouldn't like to merge it into atheris because
but I am happy to create a PR when all of those deficiencies have been fixed.
Yes, absolutely!
Hello atheris team,
I would like to propose an improvement of atheris that can at least double the execution speed of your fuzzer.
In atheris you use
sys.settrace
for coveragecollection but this is not the fastest approach.
It is possible to instrument .py files the way AFL
instruments .c files in order to get rid of all the runtime overhead.
At the beginning of each basic
block a call to
atheris.log(idx)
can be insertedthat like
__afl_maybe_log(idx)
treatsidx
as anoffset into the counter-region and increments the
corresponding byte. This way the number of instrumented
locations is known at compile-time and a call to
atheris.reg(num)
can be inserted at the very beginningof the module that tells atheris how many counters
this module needs.
At the beginning of a fuzz target where all modules
are imported atheris collects the
atheris.reg
callsand keeps track of the overall number of counters needed.
In
atheris.Fuzz()
a memory region of a suitable sizecan be allocated and used as the region for counters.
This is perfectly compatible with your
approach of fuzzing C/C++ extensions. The extensions
just have to be built with
-fsanitize-coverage=inline-8bit-counters
.However this is not perfect. It does not support
But I can image these limitations are not hard to overcome.
Here is a POC I've built that implements the concept described above: python-fuzz-poc I've used the POC to fuzz some libraries and found quite some bugs with it and the results are very promising. On average I get a 5x performance boost in contrast to
sys.settrace
.I am very interested in your opinion about this approach. What do you think of this idea?