Speeding up instrumentation

jwilk commented 2 years ago

As indicated in README, the instrumentation is slow at the moment.

Here are some rough ideas how to speed it up:

Replace sys.settrace() with lower-level PyEval_SetTrace.
Rewrite bytecode to inject instrumentation. (Perhaps use the bytecode module?)
Rewrite AST to inject instrumentation. (See how it's done in pytest.)

(I don't plan to work on any of these, unless there's funding for the work.)

tovmeod commented 1 year ago

From what I understand the trace function determines which files are covered, and the current implementation just track coverage for all python code, including installed libraries and the standard library. I believe it could skip by default the libraries (or at least be configurable) or maybe be possible to configure which modules should be covered. How should this be defined? from a environment variable, defining the path prefix of the module? MODULE_TO_COVER='/home/user/projectsrc/myapp' and then in the trace function it should have: if filename.startswith(os.environ["MODULE_TO_COVER"]): return None

With that I think it should speed up the run, it should give afl a smaller attack area.

jwilk commented 1 year ago

From what I understand the trace function determines which files are covered, and the current implementation just track coverage for all python code, including installed libraries and the standard library.

This is correct.

I believe it could skip by default the libraries (or at least be configurable) or maybe be possible to configure which modules should be covered.

There's a TODO for this:

    # TODO: make it configurable which modules are instrumented, and which are not

But I'm afraid that the cost of the extra check could ealisy exceed the savings from skipping instrumentation.

tovmeod commented 1 year ago

I'm making some tests with a sample project, this is basically what I changed on the trace function:

     if _module_path is not None:
        if filename.startswith("."):
            filename = filename[1:]
        elif filename.startswith(_module_path):
            filename = filename[len(_module_path):]
        elif filename[-9:] != "fuzzer.py":
            pass
        else:
            return trace

where module_path is a global variable, the value is passed to _init, it is expected to be something like: "/home/user/projroot"

note that I remove the prefix from the filename, I noticed that sometimes the filename uses the full path and sometimes ./, meaning the traces would be different, I didn't really debug this to understand why or when.

"fuzzer.py" is the fuzzer.py file passed to py-afl-fuzz, I don't really care about the coverage for the wrapper, but if it is not traced then afl thinks the binary has no instrumentation.

I'm getting exec speeds of up to ~5k, but the stability is very low (less than 5%), and it says "no new instrumentation output" for a lot of the initial seed corpus. Maybe I'm doing something wrong here.

I also changed the trace to something more naive: afl_area[location] += 1 hoping it would get some interesting inputs faster and maybe use it as input for a run with the regular trace. It does improve coverage faster but still only get 30 favored items and 34 new edges after 3.5M execs.

The project has ~90k LOC, so I'm thinking I should increase the map size, I see python-afl uses a 32 bit uint (a lot less then 90k). I couldn't find what's the default map size for afl or how to set it's size. I also see I should use a 64 bit hash function.

From what I understand afl expects to map blocks of code, not each line, so could we use a deterministic way to map each filename:lineno instead of hashing and truncating the hash?

Maybe I'm thinking all this wrong, I'm currently fuzzing the whole project, should I be fuzzing each function separately?

jwilk / python-afl

Speeding up instrumentation #25