Improve existing oss-fuzz coverage

mvatsyk-lsg commented 9 months ago

Pull request checklist

Please check if your PR fulfills the following requirements:

[x] Tests for the changes have been added (for bug fixes / features)
[x] Docs have been reviewed and added / updated if needed (for bug fixes / features)
[ ] Build was run locally and without warnings or errors
[ ] All previous and new tests pass

Pull request type

Please check the type of change your PR introduces: - [ ] Bugfix - [x] Feature - [ ] Code style update (formatting, renaming, typo fix) - [ ] Refactoring (no functional changes, no api changes) - [ ] Build related changes - [ ] Documentation content changes - [ ] Other (please describe): ## What is the current behavior?

Related Issue URL: N/A

What is the new behavior?

Total number of fuzzers increased from 1 to 12
6 new fuzzers for decompressing supported archive types
5 new fuzzers for compressing data into supported archive types
Added LZMA Raw test files based on the existing .lzma ones

Does this introduce a breaking change?

[x] Yes
[ ] No

This pull request will temporarily break existing oss-fuzz setup until the oss-fuzz repo accepts a corresponding pull request with the updated fuzzer configuration on their side.

Other information

The improvements to the fuzzing setup were made as a part of Google ISE project.

JiaT75 commented 8 months ago

Hello!

Thanks for the PR. This is a great start to improving the fuzz testing. I will start with a few overall comments here and then add some more specific comments directly on the commits themselves.

First, we need to be sure that we are using the fuzz resources in the best way we can. Its easy to think of the OSS-Fuzz resources as unlimited, but each project can only be fuzzed so much. We should only include a fuzz target if it provides clear value and is testing an important part of liblzma that isn't being covered by a different fuzz target. Otherwise, less useful fuzz targets will take away compute time from the more useful ones. So, can you justify the reasoning behind each of the new fuzz targets? For instance, I am not sure that the raw encoder and decoder fuzz targets are useful since their important code paths are already covered by every other fuzz target. The raw coders don't have important header data, its just raw LZMA data. I am likely missing an important fuzz case, but in my mind I can think of three useful things to fuzz in our library:

Metadata encoding/decoding (magic bytes, file headers, block headers, lzma2 chunk headers, etc.)
Filter data encoding/decoding (LZMA1, BCJ, delta)
Check functions (CRC32, CRC64, SHA256)

Next, the code itself has a lot of repeated boilerplate. Each of the fuzz targets has very little unique code. For instance, this could be reorganized into a shared header file that provides a function for encoding and a function for decoding. These functions can take the coder init function (lzma_alone_decoder(), lzma_auto_decoder(), etc.) as a function pointer arg and any needed flags or options.

We could also consider fuzzing the various BCJ filters (x86, PowerPC, ARM64, etc). These filters are designed to be applied to executable data, but will be run on non-executable data very often. So its possible that there are hidden data corruption bugs on an unexpected input sequence since they are mostly tested on executable data, making it a good candidate for fuzz testing. These filters cannot be used as raw coders at this time, so they will have to be combined in a filter chain with LZMA1/2. If we want to look for data corruption bugs, we should encode a chunk, then decode it and compare if the decoded version exactly matches the original data.

For your commit messages, we like to keep a consistent format. When we release, our Changelog is generated automatically from the contents of the commit messages. Also it helps us maintain our codebase better when the commit messages are descriptive and clear. For your commits, please have them start with the category of what they are changing. For these, I would prepend "Tests:" to the first line of each commit. The first line of each commit should be a brief description of the purpose of the commit. The following lines should explain what was changed and why. Make sure to wrap the lines of the commit message to at most 73 characters since different commit log viewers may or may not wrap long lines and it helps keep a consistent look in our Changelog.

mvatsyk-lsg commented 8 months ago

@JiaT75 thanks a lot for the swift review! I will start implementing the suggested changes asap. I am going to focus on fixing the highlighted issues before implementing any new fuzzers, if that makes sense :)

To address some of your comments above:

The new fuzz targets were chosen to cover most of the liblzma common use cases from the code of XZ utils (i.e. we expanded the tests to cover the decoders and encoders used by the CLI utilities)
I completely agree that the fuzz targets' code can be reduced via templates and shared code. Would you like to simply move the common functionality into a separate header file, or to generate the fuzzers' code dynamically by replacing the lines in a template file as well?

JiaT75 commented 8 months ago

@JiaT75 thanks a lot for the swift review! I will start implementing the suggested changes asap. I am going to focus on fixing the highlighted issues before implementing any new fuzzers, if that makes sense :)

Makes perfect sense. I noted the BCJ filter fuzzing as option to consider. We don't necessarily need to implement it or implement it right away. Just an idea of something we could also be fuzzing if we agree the value is there.

I completely agree that the fuzz targets' code can be reduced via templates and shared code. Would you like to simply move the common functionality into a separate header file, or to generate the fuzzers' code dynamically by replacing the lines in a template file as well?

I think the simplest approach would be to use a common separate header file. Creating a dynamic template would take some extra build logic whereas an extra header file would only require updating the Makefile.

mvatsyk-lsg commented 8 months ago

@JiaT75 I've addressed your comments and tested the new Makefile and fuzz targets' code on a local setup. They seem to compile and work just fine. A quick question: should I go back and edit the description of all commits before the review, or will you be making a squash during the merge?

JiaT75 commented 8 months ago

@JiaT75 I've addressed your comments and tested the new Makefile and fuzz targets' code on a local setup. They seem to compile and work just fine. A quick question: should I go back and edit the description of all commits before the review, or will you be making a squash during the merge?

We like to keep our commits small and focused, so we will likely want more than one commit for this many changes. For now, don't worry about squashing your commits until the review is basically done. At the end we can figure out how many commits are appropriate for this and squash accordingly. So feel free to keep adding fix up commits as we go.

I'll start reviewing your new changes.

mvatsyk-lsg commented 8 months ago

A follow up on the redundant fuzzers: I ran the setup without them, and the coverage difference is indeed negligible. I am removing them from the pull request

mvatsyk-lsg commented 8 months ago

Apart from the possible re-addition of options files with max_len options back to the pull request, everything is ready for your review @JiaT75 !

mvatsyk-lsg commented 8 months ago

Also, I am now questioning whether the addition of .lzma_raw files is needed, since the corresponding fuzzers were removed from the pull request

JiaT75 commented 8 months ago

Also, I am now questioning whether the addition of .lzma_raw files is needed, since the corresponding fuzzers were removed from the pull request

Its safe to remove the .lzma_raw files and the tests/files/README changes.

Thanks for all the changes so far! I feel we are getting close to this being ready.

mvatsyk-lsg commented 8 months ago

Also, I am now questioning whether the addition of .lzma_raw files is needed, since the corresponding fuzzers were removed from the pull request

Its safe to remove the .lzma_raw files and the tests/files/README changes.

Thanks for all the changes so far! I feel we are getting close to this being ready.

Thank you for the review! I've reverted the changes and added the max_len=4096 to all fuzzer options.

Larhzu commented 8 months ago

Thanks to both of you for your work so far!

There are a few things I would like to understand better. I have only skimmed OSS-Fuzz's docs so I might be asking silly questions, sorry.

Seems that renaming a fuzz target requires renaming the accumulated corpora too.
Does adding more fuzzers mean that the project-specific fuzzing resources (processor time) will be divided between the fuzzers? With a quick look I didn't find any advice about resource usage in OSS-Fuzz docs and it's not discussed much in this thread either.
The value of code coverage in fuzzing is unclear. If extending coverage by a few simple lines of code could slow down fuzzing of more important parts of the code, does it make sense to extend fuzzing coverage in that case? I'm thinking of cases where an old-school code review shouldn't take a lot of time (code snippets that are about 200 lines each and do nothing unusually complicated). Or perhaps these should be fuzzed at first but disabled after some time if they find nothing?

Examples of remaining significant overlap in the new fuzzing targets:

fuzz_encode_alone.c would test end of payload marker (EOPM) encoding in LZMA but otherwise it doesn't test much that won't be tested by fuzz_encode_stream.c. They both use the LZMA encoder in the end. So it seems that fuzz_encode_alone.c isn't useful and could maybe even be harmful due to resource usage unless the fuzzers are smart enough to spot when code paths become identical.
fuzz_decode_alone.c splits into three different decoders depending on the input. Yet the three decoders are fuzzed separately too (stream, alone, lzip). So the only extra fuzzed thing is the small auto_decoder.c.

I don't know enough about the fuzzing methods to know what actually makes sense. I would like to be assured that adding all these fuzzers adds real value.

Thanks!

mvatsyk-lsg commented 8 months ago

Hi @Larhzu !

Seems that renaming a fuzz target requires renaming the accumulated corpora too.

In the existing setup, the corpora are generated dynamically in a build.sh file. So, any modifications have to be done in a separate pull request. After we merge this pull request, I will go ahead and update the latter one to properly reflect all the changes.

Does adding more fuzzers mean that the project-specific fuzzing resources (processor time) will be divided between the fuzzers? With a quick look I didn't find any advice about resource usage in OSS-Fuzz docs and it's not discussed much in this thread either.

I did not find any hard cap of the execution time for the OSS Fuzz itself. However, their CI integration, CIFuzz will divide the shared fuzzing time of 10 minutes between all fuzz targets in the project.

Getting back to the OSS Fuzz, each fuzz target will be run on a dedicated machine with 1 CPU and a cap of ~ 2GB RAM.

Since the fuzzers are written in C/C++, I doubt that introducing new fuzzers, at least for now, will decrease the overall quality of the fuzzing output. On my test setup inside a VM with similar hardware parameters, the fuzzing and the generation of an introspector report took around 5 minutes.

The value of code coverage in fuzzing is unclear. If extending coverage by a few simple lines of code could slow down fuzzing of more important parts of the code, does it make sense to extend fuzzing coverage in that case?

This absolutely makes sense. However, current fuzzing setup is very limited and covers only half of the lib (since --disable-encoders flag is used during the compilation). Its runtime coverage is 116/162 functions. The setup proposed in this pull request extends the fuzzing coverage to all common encoders and decoders to increase the runtime coverage to 270/360 functions.

fuzz_encode_alone.c would test end of payload marker (EOPM) encoding in LZMA but otherwise it doesn't test much that won't be tested by fuzz_encode_stream.c. They both use the LZMA encoder in the end. So it seems that fuzz_encode_alone.c isn't useful and could maybe even be harmful due to resource usage unless the fuzzers are smart enough to spot when code paths become identical.

fuzz_decode_alone.c splits into three different decoders depending on the input. Yet the three decoders are fuzzed separately too (stream, alone, lzip). So the only extra fuzzed thing is the small auto_decoder.c.

We can remove those, however this decreases the runtime fuzzing coverage from 270/360 to 249/360 functions. Should we proceed with deleting the fuzz targets?

JiaT75 commented 8 months ago

In the existing setup, the corpora are generated dynamically in a build.sh file. So, any modifications have to be done in a separate pull request. After we merge this pull request, I will go ahead and update the latter one to properly reflect all the changes.

I believe what was meant was that we have built up a very large corpus over the years on the fuzz.c fuzz target. Since that is renamed to fuzz_decode_stream.c in this PR, we would lose that large corpus if we do not take the proper steps to prevent that. We can either not rename this fuzz target or download a copy and restore it. I prefer the latter, and I have already downloaded a recent version of the corpus so it can be restored later.

Does adding more fuzzers mean that the project-specific fuzzing resources (processor time) will be divided between the fuzzers? With a quick look I didn't find any advice about resource usage in OSS-Fuzz docs and it's not discussed much in this thread either.

I did not find any hard cap of the execution time for the OSS Fuzz itself. However, their CI integration, CIFuzz will divide the shared fuzzing time of 10 minutes between all fuzz targets in the project.

I don't see us incorporating CIFuzz since features get integrated into OSS-Fuzz soon after they are committed anyways. The real question is how OSS-Fuzz divides up time between fuzz targets. I have not seen any description of this on the OSS-Fuzz online documentation so we would likely have to look into their internals to truly answer that question.

The value of code coverage in fuzzing is unclear. If extending coverage by a few simple lines of code could slow down fuzzing of more important parts of the code, does it make sense to extend fuzzing coverage in that case?

This absolutely makes sense. However, current fuzzing setup is very limited and covers only half of the lib (since --disable-encoders flag is used during the compilation). Its runtime coverage is 116/162 functions. The setup proposed in this pull request extends the fuzzing coverage to all common encoders and decoders to increase the runtime coverage to 270/360 functions.

The point here is that we don't want to over-emphasize the importance of code coverage. Fuzzing is computationally expensive so increasing the code coverage should only be done if we are increasing meaningful code coverage. I would much rather fuzz 1 complicated function that 10 simple ones.

So the goal shouldn't be to hit a certain percentage of code coverage. The goal should be to fuzz 100% of critical complicated code. And we don't expect you to know what all the critical complicated code in our project is, thats where we need to work together.

We do appreciate your efforts so far. I know it doesn't feel great to remove things, but in this case less is more.

The LZMA encoder certainly counts as critical complex code and that fuzz target adds a lot of value :)

fuzz_encode_alone.c would test end of payload marker (EOPM) encoding in LZMA but otherwise it doesn't test much that won't be tested by fuzz_encode_stream.c. They both use the LZMA encoder in the end. So it seems that fuzz_encode_alone.c isn't useful and could maybe even be harmful due to resource usage unless the fuzzers are smart enough to spot when code paths become identical.

fuzz_decode_alone.c splits into three different decoders depending on the input. Yet the three decoders are fuzzed separately too (stream, alone, lzip). So the only extra fuzzed thing is the small auto_decoder.c.

Each fuzz target needs to provide justifiable value outside of just extra code coverage. With the above points in mind, it feels safe to only consider the following fuzz targets:

fuzz_decode_alone. This test focuses on LZMA1 header/data fuzzing and EOPM handling.
fuzz_decode_lzip. This test focuses on LZIP header fuzzing. The LZIP header parsing is fairly simple and we have tests for it in the test framework, but maybe its still worth fuzzing.
fuzz_decode_stream_crc. This test focuses on portions not covered by fuzz_decode_stream. So mainly when the code leading to the check functions and the check functions themselves. The check functions are complicated and may deserve fuzzing, but the code leading to them may not really need fuzzing. So a more focused fuzz target that just calls the CRC and SHA256 functions directly could be more efficient since it avoids the LZMA decoding (which is covered by the other fuzzers).
fuzz_decode_stream. This test focuses on fuzzing .xz headers, block headers, index, etc. Additionally it fuzzes LZMA2 and LZMA1 decoding.
fuzz_encode_stream. This test focuses on encoding a .xz file including all of the things in fuzz_decode_stream, except on the encoder side.

There are a few areas where fuzzing could be expanded if we agree these are critical complex code paths:

BCJ Filters (already partially covered by fuzz_decode_stream)
Delta Filter (already partially covered by fuzz_decode_stream)
Different encoder settings (different match finders, dictionary sizes, LZMA properties, etc.). This could be accomplished by having an additional fuzz target for the encoder using preset 1. Instead of the default preset in fuzz_encode_stream.c we could use preset 5 since it should be a little faster (smaller nice_len) but have almost all the same settings.

The fuzzer machines do not have multiple cores, so unfortunately it doesn't make sense to fuzz the multithreaded stream encoder/decoder code. Otherwise that would be another candidate for critical complex code.

mvatsyk-lsg commented 8 months ago

@JiaT75 @Larhzu thank you both for your time and effort to merge this pull request! A couple of updates on my side:

I have removed all of the extra fuzz targets deemed unnecessary from this pull request
I have also been trying to figure out the exact time and resource limits in place on the OSS Fuzz's side of the setup by digging through their source code. There are many timeout options in place, and I am not really sure about their purpose. However, I at least found that the fuzzers are run with the -max_total_time=10 (10 seconds) argument during the generation of an introspector report (the same type of the report that is publicly available for your project).

Just to be sure, I have also emailed one of the OSS Fuzz maintainers to get the answer from them. I will follow up on this discussion once I receive a reply

JiaT75 commented 8 months ago

Just to be sure, I have also emailed one of the OSS Fuzz maintainers to get the answer from them. I will follow up on this discussion once I receive a reply

Thanks for looking into this!

After some thought, it seems like a better use of resources to omit fuzz_decode_stream_crc.c and fuzz_decode_lzip.c. If we think the CRC code should be fuzzed we can add a fuzz target to directly test the various check functions. I'm not sure this will be needed since the input data to the check functions doesn't have much impact on the code path taken. On the Lzip side, this feature isn't used much and the header is very simple. We have tests that cover this in the test framework already so it doesn't feel worth the resources to fuzz it when we already have two other fuzz targets that hit the interesting code paths (alone and stream decoder).

Instead, we should split fuzz_encoder_stream into two separate fuzz targets. The first could be called fuzz_encode_stream and the second fuzz_encode_stream_light. fuzz_encode_stream should use preset level 5 and fuzz_encode_stream_light should use preset level 1.

After this, I think we are ready to squash the commits. As long as the commits are well organized it doesn't matter exactly how you choose to squash them. Here is one suggestion:

Move fuzz.c to fuzz_decode_stream.c
Separate logic from fuzz_decode_stream.c into fuzz_common.h
Makefile changes
Add fuzz_decode_alone fuzz target
Add fuzz_encode_stream fuzz target
Add fuzz_encode_stream_light fuzz target

Commits 5 and 6 could be combined, up to you.

mvatsyk-lsg commented 8 months ago

Okay, @JiaT75, I've rebased the pull request. Does the commit history look good to you?

DavidKorczynski commented 8 months ago

Thanks for reaching out @mvatsyk-lsg -- I didn't go through the whole discussion here so am trying to give an OSS-Fuzz perspective from a limited understanding of this PR.

Regarding OSS-Fuzz resources, then I think by default it makes sense to not be too concerned about this. OSS-Fuzz relies on Clusterfuzz which has a set of scheduling/prioritisation strategies. A single fuzzer for CRC may be a bit much. However, it's also possible to merge a bunch of simple fuzzers into a single larger function:

int LLVMFuzzerTestOneInput(uint8_t *data, size_t size) {

  if (size < 1) {
    return 0;
  }
  uint8_t decider = data[0];
  data++;
  size--;
  switch decider {
      case 1: { fuzz_first_entrypoint(data, size); break; }
      case 2: { fuzz_second_entrypoint(data, size); break; }
      ...
      case N
    }
}

This is often a common strategy for hitting smaller functions. In fact, you can even do this by throwing the same smaller fuzzers into the larger meaningful fuzzers -- the fuzzer will through it's mutational genetic algorithm start exploring the code where there is more code to explore, so more efforts will be "put in the right places".

The scheduling in Clusterfuzz will be responsible for dividing time allocated to each of the targets.

That said, it's often less meaningful to fuzz code which has essentially no data processing, since the code execution will happen independent of the data provided by the fuzzer. Targeting this type of code is probably not the best and I wouldn't recommend fuzzing that sort of code.

JiaT75 commented 8 months ago

Thank you @DavidKorczynski for the explanation and the advice for combining fuzzers. I had not thought of using a byte from the fuzz input to control the fuzzer's entry point.

With that in mind, @mvatsyk-lsg we should combine fuzz_encode_stream.c and fuzz_encode_stream_light.c into just one fuzzer. We can use the same name fuzz_encode_stream.c for this fuzz target. We can use the first byte of input to help us determine the preset level. So the function could look like:

extern int
LLVMFuzzerTestOneInput(const uint8_t *inbuf, size_t inbuf_size)
{
    if (size == 0)
        return 0;

    lzma_stream strm = LZMA_STREAM_INIT;

    uint32_t preset_level;

    uint8_t decider = inbuf[0];

    switch (decider) {
    case 0:
    case 1:
    case 5:
        preset_level = (uint32_t)decider;
        break;
    case 6:
        preset_level = 0 | LZMA_PRESET_EXTREME;
        break;
    case 7:
        preset_level = 3 | LZMA_PRESET_EXTREME;
        break;
    default:
        return 0;
    }

    lzma_options_lzma opt_lzma;
    if (lzma_lzma_preset(&opt_lzma, preset_level)){
        fprintf(stderr, "lzma_lzma_preset() failed\n");
        abort();
    }

    // Initialize filter chain for lzma_stream_decoder() call
    // Use single LZMA2 filter for encoding
    lzma_filter filters[2];
    filters[0].id = LZMA_FILTER_LZMA2;
    filters[0].options = &opt_lzma;
    filters[1].id = LZMA_VLI_UNKNOWN;

    // Initialize the stream encoder using the above
    // filter chain and CRC64.
    if (lzma_stream_encoder(&strm,
            filters, LZMA_CHECK_CRC64) != LZMA_OK) {
        fprintf(stderr, "lzma_stream_encoder() failed\n");
        abort();
    }

    fuzz_code(&strm, inbuf  + 1, inbuf_size - 1);

    // Free the allocated memory.
    lzma_end(&strm);
    return 0;
}

This can test a few different dictionary sizes, match finders, nice length, modes, and depth levels with the same fuzzer.

mvatsyk-lsg commented 8 months ago

@JiaT75 done!

JiaT75 commented 8 months ago

@mvatsyk-lsg Thanks! Things are looking pretty good now. I created a separate branch with all of your commits plus a minor cleanup commit. Can you test this branch to be sure I didn't break anything during my changes?

As I'm sure you know, the draft PR in OSS-Fuzz needs updating from all the changes we made here. I want to do a quick local test before merging but it will be easier if the OSS-Fuzz changes are updated on that PR.

mvatsyk-lsg commented 8 months ago

@JiaT75 the changes look good to me! I have also updated the pull request to the oss-fuzz repo. To test the new fuzzing setup locally, you can run the following commands on your machine:

# clone my fork of the oss-fuzz repo with pull request changes
git clone https://github.com/mvatsyk-lsg/oss-fuzz
cd oss-fuzz/

# update the Dockerfile to clone the oss_fuzz branch specifically
sed -i 's/git clone /git clone -b oss_fuzz /' projects/xz/Dockerfile

# build project image
python3 infra/helper.py build_image 'xz'

# generate introspector report 
python3 infra/helper.py introspector 'xz'

JiaT75 commented 8 months ago

@mvatsyk-lsg Thanks for the commands to use for a local test. You caught my mistake but testing locally also highlighted it. I just merged your commits into master.

Great work with this! Thanks for being so flexible and responsive with all the changes.

tukaani-project / xz