google / oss-fuzz

OSS-Fuzz - continuous fuzzing for open source software.
https://google.github.io/oss-fuzz
Apache License 2.0
10.32k stars 2.2k forks source link

Python Coverage Question #11215

Open capuanob opened 10 months ago

capuanob commented 10 months ago

Hello,

I'd appreciate some more clarity on how ClusterFuzz calculates Python coverage. If I run coverage locally, I get a coverage above 50% for a project I am working on. However, on ClusterFuzz, that same project has a coverage report at 37% due to the inclusion of extraneous site-packages used by the library being fuzzed.

So, my questions are the following

  1. Is there a way to disclude certain libraries from the Python coverage report?
  2. How is coverage "graded" for the >50% ideal integration standard? Is it based solely off library source files, library source files + dependencies, or the introspector report?

Thank you for your review and help!

DavidKorczynski commented 10 months ago

Could you clarify how you generate the coverage report locally? The commands specifically? And which project is this?

capuanob commented 10 months ago

@DavidKorczynski This one is specifically icalendar, and I'm following the guidance on the atheris documentation here

DavidKorczynski commented 10 months ago

@DavidKorczynski This one is specifically icalendar, and I'm following the guidance on the atheris documentation here

Okay, they use the same approach. FYI you can generate coverage easily by way of OSS-Fuzz, simply run the command:

python3 infra/helper.py introspector icalendar

This will produce a default corpus generation and code coverage report generation, but you can adjust e.g. to run fuzzers for longer to collect corpus or rely on using an existing corpus.

The reason you see extra packages in your code coverage reports on OSS-Fuzz is that OSS-Fuzz relies on bundling your fuzzers + target code by way of pyinstaller: https://github.com/google/oss-fuzz/blob/41ee0518d0f15b2288678d86dcc5b1c02f33ca5d/infra/base-images/base-builder/compile_python_fuzzer#L105 This causes all dependencies that need to be included for the Python project to work by way of a single binary to be bundled in a single binary, from which the fuzzing is also executed. The coverage collection happens on this bundled set of Python modules, which means that all code, including certain 3rd party deps, is considered "part of the program". The coverage report generation excludes certain python core modules, but some 3rd party deps e.g. dateutil in the context of icalender, it does not.

I don't think it should filter them out though, largely because I consider 3rd party deps to be a part of your application just as much as the code you've written yourself. I don' think we should exclude it for the Q2 above.

capuanob commented 10 months ago

@DavidKorczynski Thank you for the clarification and the tip on coverage generation, I'll be sure to use that moving forward.

Based off my experience in harnessing Python projects, a lot of Python libraries utilize tons of dependencies for a very small subset of their features. In the case of libraries that import numpy or pandas, for example, I don't think it's possible to achieve 50% code-coverage if the percentage is heavily swayed by tens of thousands of lines of code that aren't even reachable from the library being tested. Do you have any tips for handling these cases and how to approach the coverage situation for these?

Thank you for your help and advice!

DavidKorczynski commented 10 months ago

I don't think it's possible to achieve 50% code-coverage if the percentage is heavily swayed by tens of thousands of lines of code

I don't see the ability to cover code as changing just because we filter stuff out -- I get the idea that if we filter out stuff the number reported at the top of the code coverage reports showing the accumulated coverage of all files in the report will go up, but the actual code being covered remains the same. This seems to be more related to Q2 above? If so, then I think I'd prefer to keep coverage as is and just provide argumentation for why code coverage is X percentage (not necessarily as the coverage report is), perhaps with arguments leaning towards code coverage of attack surface.

FWIW this is not a problem unique to Python, all of the other languages have the same issue. See e.g the Envoy code coverage report where at least 1.2 million lines of code are external depencies external and external -- some is bloat and some isn't, but very likely it's impossible to get the code coverage report to 100%.

code that aren't even reachable from the library being tested

The problem is this can be tricky to assess. For some cases there may be obvious bloat, but for others it may not be obvious, and in many cases we are quite interested in understanding what part of 3rd party deps has coverage.

For example, say a given project relies on a 3rd party image parsing/serialization library, then we definitely want to know the coverage of this parsing logic, which if we filter out likely implies we are missing insight into an important part of the attack surface. We could show this as follows:

def parse_user_input(raw_user_input):
  serialized_inp = some_third_party_serialization_lib.parse_raw_data(raw_user_input)
  ...

In this case getting 100% coverage of the line serialized_inp = some_third_party_serialization_lib.parse_raw_data(raw_user_input) is trivial, but the real substance is in some_third_party_serialization_lib.parse_raw_data, and insight of coverage in this lib is necessary for understanding coverage of attack surface.

Naturally, in many cases, some_third_party_serialization_lib should itself be fuzzed, but, I'm still leaning towards that this should be part of a project's coverage report, not least because e.g. parse_raw_data may be used in an erroneous manner (i.e. we can fuzz memcpy forever and find no issues, but many projects may fail to provide arguments that meet the input domain of memcpy which causes issues)..

Fuzz Introspector performs reachability analysis to assist in this process, but the dynamic nature of Python makes the problem particularly difficult. I think this also speaks to not filtering out stuff as a general solution is likely to exclude code that is reachable/on the attack surface.

For the purposes of tracking code coverage and using it as an estimate for completion analysis I think it would be nice to have improved insights into code coverage. I think assessing code coverage on a folder-level helps with this, but it will still be subject to the limitation mentioned above.

Again, to Q2 above I think it would be most complete to have some more qualitative argument as to why code coverage is X percent.

DavidKorczynski commented 9 months ago

There is a proposed solution to the same problem in Java here: https://github.com/google/oss-fuzz/pull/10860

In essence we have the same option in C/C++ because we can control which code is introspected with code coverage instrumentation.

Let me see about the option for adding prefixing capabilities to Python code coverage.