google / atheris

Apache License 2.0
1.35k stars 112 forks source link

Coverage from regexp #5

Closed alex closed 1 year ago

alex commented 3 years ago

I'm playing with using atheris on some code that uses regexp (stdlib re) for control flow. This is a common pattern for things like HTTP routing in Python.

Unfortunately, back the _sre module is in C (and isn't compiled with -fsanitize=fuzzer-no-link) there's no coverage for it. This means that atheris really struggles to make progress through it, effectively just throwing darts at it.

If atheris was somehow "regexp aware" that could alleviate this situation. I don't have a well thought out proposal, but off the cuff something that might work is:

In the tracer, if you are calling a method and self is a regexp instance, grab the .pattern attribute off of it, pass that to sre_parse.parse() and then take any repeated sequences of LITERAL opcodes and feed them into the dictionary. psuedocode:

def handle_re(r):
    ops = sre_parse.parse(r.pattern)
    idx = 0
    while idx < len(ops):
        op, val = ops[idx]
        if op == LITERAL:
            end_idx = find_next_non_literal_op(ops[idx+1:])
            insert_to_corpus("".join(v for _, v in ops[idx:end_idx]))
        elif op == SUBPATTERN:
            do_this_algorithm(val[3])
alex commented 3 years ago

If this is useful, here's a real version of that algorithm:

import sre_parse
import re

import pytest

def extract_literals(r):
    ops = sre_parse.parse(r.pattern)
    results = []
    extract_literals_from_ops(ops, results)
    return results

def extract_literals_from_ops(ops, results):
    i = 0
    while i < len(ops):
        op, val = ops[i]
        if op == sre_parse.LITERAL:
            start_i = i
            while i < len(ops) and ops[i][0] == sre_parse.LITERAL:
                i += 1
            results.append("".join(chr(c) for _, c in ops[start_i:i]))
            continue
        elif op == sre_parse.BRANCH:
            _, branches = val
            for branch in branches:
                extract_literals_from_ops(branch, results)
        elif op == sre_parse.SUBPATTERN:
            _, _, _, sub_ops = val
            extract_literals_from_ops(sub_ops, results)
        elif op == sre_parse.MAX_REPEAT:
            _, _, sub_ops = val
            extract_literals_from_ops(sub_ops, results)
        elif op == sre_parse.ASSERT or op == sre_parse.ASSERT_NOT:
            _, sub_ops = val
            extract_literals_from_ops(sub_ops, results)
        i += 1
    return results

@pytest.mark.parametrize(
    ("r", "expected"),
    [
        (r"^abc$", ["abc"]),
        (r"abc|def", ["abc", "def"]),
        (r"(abc|\d+)", ["abc"]),
        (r"(?:abc){3,}", ["abc"]),
        (r"(?:abc){,3}", ["abc"]),
        (r"(?=abc)", ["abc"]),
        (r"(?!abc)", ["abc"]),
        (r"(?<=abc)", ["abc"]),
        (r"(?<!abc)", ["abc"]),
    ]
)
def test_extract_literals(r, expected):
    actual = extract_literals(re.compile(r))
    assert actual == expected
Zac-HD commented 3 years ago

hypothesis.strategies.from_regex() can help if you want to write a fuzz harness which generates strings matching some regexp, or you could just use it to generate a seed corpus for the fuzzer.

TheShiftedBit commented 3 years ago

Hi Alex,

That does seem useful!

How do you propose it be integrated into libFuzzer?

alex commented 3 years ago

In the Tracer (https://github.com/google/atheris/blob/master/tracer.cc#L269) have it grab self off the frame, if isinstance(self, regexp) then run something akin to the logic I suggested.

TheShiftedBit commented 3 years ago

Oh, I'm asking about the second part: once we have the data, how do we inject it into libFuzzer? To do it as you suggest, I think I'd have to modify libFuzzer to expose the ability to programmatically inject new dictionary events. That's possible, but means that only users with up-to-date Clang will get the feature. Do you have any other ideas?

alex commented 3 years ago

Oh, sorry I misread you. Can we inject it via memcmp(value, value)? That basically causes it to go into teh dictionary, right?

TheShiftedBit commented 3 years ago

Looks like that might provide some benefit. Such comparisons end up in the "TableOfRecentCompares" (max size 32), which based on randomness might end up in a dictionary.

Will probably have to call that every time the regexp is invoked - but can probably cache the extracted data.

TheShiftedBit commented 3 years ago

As a stop-gap, it's pretty easy to build just the _sre module with coverage. Use the provided patches for linking libFuzzer into CPython, but also do the following.

First, configure it:

CC=clang CXX=clang++ ./configure

Then, find the right line in the generated Makefile:

Modules/_sre.o: $(srcdir)/Modules/_sre.c; $(CC) $(PY_BUILTIN_MODULE_CFLAGS) -c $(srcdir)/Modules/_sre.c -o Modules/_sre.o

Add -fsanitize=fuzzer-no-link right before -c

Then, run:

make LIBFUZZER_VERSION=path/to/libclang_rt.fuzzer_no_main-x86_64.a -j 100

Finally, use atheris_no_libfuzzer instead of atheris.

alex commented 3 years ago

Seems like it'd also be sensible to send a feature request to libFuzzer to allow dynamic additions to the dictionary.

TheShiftedBit commented 3 years ago

Already in progress! I've got a patch prepared, which I'll send to LLVM as soon as I've confirmed it works well for our use case. From my experience, simulating a memcmp(foo, foo) was not effective. It really requires dynamic dictionary injection. Atheris will be programmed to detect whether it was using a sufficiently new version of libFuzzer. If not, it will print a warning in the event that regular expressions are encountered.,

alex commented 3 years ago

Oh wonderful, FYI https://github.com/alex/httpfuzz is the context that led me down this rabbit hole.

On Mon, Dec 21, 2020 at 1:22 PM Ian Eldred Pudney notifications@github.com wrote:

Already in progress! I've got a patch prepared, which I'll send to LLVM as soon as I've confirmed it works well for our use case. From my experience, simulating a memcmp(foo, foo) was not effective. It really requires dynamic dictionary injection. Atheris will be programmed to detect whether it was using a sufficiently new version of libFuzzer. If not, it will print a warning in the event that regular expressions are encountered.,

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/atheris/issues/5#issuecomment-749123359, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAGBAT4V5WIAWWWJI3XCDSV6G7NANCNFSM4UMQDZZA .

-- All that is necessary for evil to succeed is for good people to do nothing.

TheShiftedBit commented 3 years ago

This has proved to be somewhat effective. Here's the LLVM change: https://reviews.llvm.org/D93879

While somewhat effective, many regular expressions still fail to be reasonably navigated with this strategy. I think this can be improved by computing a string that will match the whole regular expression, and adding that to the dictionary.

alex commented 3 years ago

Yes, if there's a practical algorithm for generating a regexp match, that's probably even better.

On Mon, Dec 28, 2020 at 8:25 PM Ian Eldred Pudney notifications@github.com wrote:

This has proved to be somewhat effective. Here's the LLVM change: https://reviews.llvm.org/D93879

While somewhat effective, many regular expressions still fail to be reasonably navigated with this strategy. I think this can be improved by computing a string that will match the whole regular expression, and adding that to the dictionary.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/atheris/issues/5#issuecomment-751914333, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAGBFTATLN7QE633G4JHDSXEVXTANCNFSM4UMQDZZA .

-- All that is necessary for evil to succeed is for good people to do nothing.

TheShiftedBit commented 3 years ago

If Python used true regular expressions, where everything is representable as a DFA, then it would be as simple as a Dijkstra's walk over the DFA. But, Python supports features like negative lookbehind, which are O(n^2) and cannot be represented in a DFA. It's even possible to write Python regular expressions that can never match a string. For example, a(?<!a)b matches "the string ab where b is not proceeded by a". If we exclude lookahead, lookbehind, and a few rarely used features, it should be easy. That might be good for an initial version.

TheShiftedBit commented 3 years ago

Alright, generating a string that matches a regex and adding it to the dictionary is highly effective.

alex commented 3 years ago

Fantastic! What benchmark are you using to establish that?

TheShiftedBit commented 3 years ago

So far, I've been testing whether Atheris can fuzz past a wide variety of regular expressions. With this latest version that generates matches, it can get past any expression I've come up with that doesn't use a zero-width match (e.g. lookahead, lookbehind, or word boundaries).

I've also realized, I can use Atheris to fuzz itself on this. I can write a fuzzer that ensures Atheris can take a regular expression and generate a matching string for any set of regular expression features I support.

However, it appears LLVM is reluctant to add LLVMFuzzerAddToDictionary() support to libFuzzer. https://reviews.llvm.org/D93879 I'm reluctant to work on this more until that is resolved.

alex commented 3 years ago

Sounds good -- FWIW you may also find the httpfuzz demo I linked above useful for this. Without the current manual dictionary, it makes no progress at all.

On Thu, Dec 31, 2020 at 12:34 PM Ian Eldred Pudney notifications@github.com wrote:

So far, I've been testing whether Atheris can fuzz past a wide variety of regular expressions. With this latest version that generates matches, it can get past any expression I've come up with that doesn't use a zero-width match (e.g. lookahead, lookbehind, or word boundaries).

I've also realized, I can use Atheris to fuzz itself on this. I can write a fuzzer that ensures Atheris can take a regular expression and generate a matching string for any set of regular expression features I support.

However, it appears LLVM is reluctant to add LLVMFuzzerAddToDictionary() support to libFuzzer. https://reviews.llvm.org/D93879 I'm reluctant to work on this more until that is resolved.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/atheris/issues/5#issuecomment-753012367, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAGBFSIKKNYIFPSHLIJBLSXSY2PANCNFSM4UMQDZZA .

-- All that is necessary for evil to succeed is for good people to do nothing.

TheShiftedBit commented 2 years ago

I just pushed a new version that includes experimental support for this. It's opt-in for now, because there are some performance downsides, but it now works!

jvoisin commented 1 year ago

Let's close this issue, and make it enabled by default at some point :)