pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.96k stars 930 forks source link

OSSFuzz Integration #949

Closed capuanob closed 4 months ago

capuanob commented 8 months ago

Pull request

This Pull-Request includes the necessary changes to integrate fuzzing into pdfminer.six for OSS-Fuzz continuous fuzzing, as discussed in Issue 918.

In short, this PR adds atheris (the fuzzing framework) as a development dependency, and a new fuzzing directory containing a corpus, some initial harnesses, the necessary CI file to integrate the project into OSSFuzz, and a build script to be used by ClusterFuzz to prepare for nightly fuzzing.

In addition to the above, two simple bug-fixes are resolved to address crashes that were occurring too early into fuzzing, preventing progress.

How Has This Been Tested?

The fuzzing harnesses are tests in and of themselves, so they were tested via coverage analysis and allowing them to run.

NOTE: The CIFuzz.yml job will fail until Google merges the necessary pdfminer Dockerfile into their OSS-Fuzz repository. This can only be done after this PR is merged.

Checklist

capuanob commented 8 months ago

@pietermarsman Hey Pieter, pinging for visibility. Looking forward to getting this integrated and uncovering bugs!

capuanob commented 6 months ago

@pietermarsman Following up on this, I spent a good amount of time on this and would love to see it integrated!

capuanob commented 5 months ago

@goulu @jstockwin @pudo @tataganesh @pietermarsman I hope all is well, I would really appreciate a review of this PR

pietermarsman commented 4 months ago

Hi @capuanob,

Thanks for your time on this. This repo is maintained on a very slow pace. But it is maintained, so your work won't go to waste.

I haven't ran the code yet, will do so later today. But it looks good. Great that you already were able to find and fix some vulnerabilities.

I've two initial comments:

capuanob commented 4 months ago

Thank you for your reply!

On Mon, Jun 24, 2024 at 2:46 AM Pieter Marsman @.***> wrote:

Hi @capuanob https://github.com/capuanob,

Thanks for your time on this. This repo is maintained on a very slow pace. But it is maintained, so your work won't go to waste.

I haven't ran the code yet, will do so later today. But it looks good. Great that you already were able to find and fix some vulnerabilities.

I've two initial comments:

  • It looks like you have copied the testing pdf's. Are these even used? Can you also use their equivalents from the samples directory?
  • Is it common to have fuzzing as a top-level directory? I guess it has the same status as the tests and tools, so it seems like the right place. But I'm always reluctant to add top-level files and directories.
  • Is it also possible and useful to run the tests locally? In that case we can/should perhaps add the commands to the noxfile.py.

— Reply to this email directly, view it on GitHub https://github.com/pdfminer/pdfminer.six/pull/949#issuecomment-2185735264, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHXFHTNNDBGGWS37IID6K63ZI66D3AVCNFSM6AAAAABEPH7OO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBVG4ZTKMRWGQ . You are receiving this because you were mentioned.Message ID: @.***>

pietermarsman commented 4 months ago

I could update this to copy from the samples directory at run-time, rather than hosting them twice

Yes, that is preferable. Maybe use a glob to specify them.

For the projects that I have done, I've either had fuzzing be a top-level directory or a sub-directory of testing. I can adjust to whichever you prefer.

I guess I prefer to have it as a top-level directory. Since the tests directory is really the "pytest directory". So no change required here.

While it is possible, I would discourage having it be a local test for a few reasons. Since fuzzing isn't entirely deterministic, you won't get the same kind of consistency as you would from unit tests. Secondly, there isn't a defined end-point for fuzzing so it'd be difficult to set an arbitrary 'timeout' and be able to definitively say that something has been sufficiently tested

Ok, good to know.

I've ran out of time for today, so will continue on this next Monday. I noticed that my understanding of fuzzing is very minimal, do you have any good resources that I can use to improve my understanding?

Edit: I've found this tutorial which gave me a good understanding. I do yet fully understand how the fuzzer can efficiently mutate the corpus pdf files to generate new valid ones though. But perhaps it just tries a gazillion of times.

capuanob commented 4 months ago

@pietermarsman I've removed the corpus files from this commit.

Now, instead, the Dockerfile that I will submit to Google's OSS-Fuzz project after this PR is integrated (you can see this Dockerfile here if you are interested) will glob the simple*.pdf files into a corpus.

You've got the right idea on how it makes mutations to the input, since it typically requires gazillions of mutations. To give a bit more context, the atheris Python library will "instrument" the library with extra instructions in each code block (ie, any branch) at runtime. The fuzzing framework strives to achieve depth and breadth in its coverage and will analyze its current code block boundaries to determine what kind of intelligent changes could be made to get deeper into the parsing code.

With time, the fuzzing corpus evolves to be more and more robust. ClusterFuzz also provides great insights on current fuzz-blockers that can be overcome with future PRs to improve the fuzzers.

Another thing I'd be interested in exploring in the future is grammar-based fuzzing. I've never implemented one myself, but am aware of the technique. You can use a grammar (say, the grammar for a PDF) to guide smarter mutations.

Happy to provide more context if desired!

capuanob commented 4 months ago

@pietermarsman I just saw your question about resources on fuzz-testing. The LibFuzzer docs are great, I'd also suggest:

https://google.github.io/oss-fuzz/ (More details on this specific program and ClusterFuzz - which you will get access to)

The researchers at Trail of Bits have a good guide on fuzzing as well. https://appsec.guide/docs/fuzzing/python/

The Python section isn't fully built-out, but atheris also uses LibFuzzer (which is commonly used for C/C++ fuzzing).

capuanob commented 4 months ago

Good morning,

Feel free to contribute any commits you would like, I want to make sure this contribution integrates well into the larger project!

I will contribute the rest by the end of this weekend.

On Thu, Jun 27, 2024 at 2:37 AM Pieter Marsman @.***> wrote:

@.**** requested changes on this pull request.

Top! It is looking great!

Code is running smooth on my machine and I've already found a couple of more bugs. So I'm curious about the results from ClusterFuzz.

I've scattered the MR with a bunch of micro-management comments to get it into the same shape as the rest of the code base. Most importantly is adding the fuzzing directory to the noxfile.py testing dirs so that all the code quality checks run on it.

I can do a couple of small commits to help if you're ok with that.

In CHANGELOG.md https://github.com/pdfminer/pdfminer.six/pull/949#discussion_r1656516635 :

@@ -8,6 +8,7 @@ The format is based on Keep a Changelog.

Added

  • Support for zipped jpeg's (#938) +- Added fuzzing harnesses for integration into Google's OSS-Fuzz

⬇️ Suggested change

-- Added fuzzing harnesses for integration into Google's OSS-Fuzz +- Fuzzing harnesses for integration into Google's OSS-Fuzz (949)


In fuzzing/extract_text_fuzzer.py https://github.com/pdfminer/pdfminer.six/pull/949#discussion_r1656520023 :

@@ -0,0 +1,45 @@ +import sys + +import atheris + +from fuzz_helpers import EnhancedFuzzedDataProvider + +with atheris.instrument_imports():

  • from pdf_utils import PDFValidator, prepare_pdfminer_fuzzing
  • from pdfminer.high_level import extract_text
  • +from pdfminer.psparser import PSException

  • +def TestOneInput(data: bytes):

What do you think of using snake_case for function names?

I know the atheris docs use CamelCase for Python code, but the rest of this repo uses snake_case.

In fuzzing/extract_text_fuzzer.py https://github.com/pdfminer/pdfminer.six/pull/949#discussion_r1656523269 :

+ + +def TestOneInput(data: bytes):

  • if not PDFValidator.is_valid_byte_stream(data):
  • Not worth continuing with this test case

  • return -1
  • fdp = EnhancedFuzzedDataProvider(data)
  • try:
  • with fdp.ConsumeMemoryFile() as f:
  • max_pages = fdp.ConsumeIntInRange(0, 1000)
  • extract_text(
  • f,
  • maxpages=max_pages,
  • page_numbers=fdp.ConsumeIntList(fdp.ConsumeIntInRange(0, max_pages), 2),

The page_numbers can also be None. I'm wondering if we can test that as well.

Also in the other fuzzers.

In fuzzing/extract_text_fuzzer.py https://github.com/pdfminer/pdfminer.six/pull/949#discussion_r1656524146 :

+ +def TestOneInput(data: bytes):

  • if not PDFValidator.is_valid_byte_stream(data):
  • Not worth continuing with this test case

  • return -1
  • fdp = EnhancedFuzzedDataProvider(data)
  • try:
  • with fdp.ConsumeMemoryFile() as f:
  • max_pages = fdp.ConsumeIntInRange(0, 1000)
  • extract_text(
  • f,
  • maxpages=max_pages,
  • page_numbers=fdp.ConsumeIntList(fdp.ConsumeIntInRange(0, max_pages), 2),
  • laparams=PDFValidator.generate_layout_parameters(fdp)

laparams can also be None.

Also in the other fuzzers.

In fuzzing/extract_text_fuzzer.py https://github.com/pdfminer/pdfminer.six/pull/949#discussion_r1656526230 :

@@ -0,0 +1,45 @@ +import sys + +import atheris + +from fuzz_helpers import EnhancedFuzzedDataProvider

Would it be possible to start all imports from the root of the project? Usually that makes it easier to get the imports working.

So that would require from fuzzing.fuzz_helpers import ... here.

Add a init.py file to the fuzzing dir should get it to work.

In fuzzing/extract_text_fuzzer.py https://github.com/pdfminer/pdfminer.six/pull/949#discussion_r1656528176 :

@@ -0,0 +1,45 @@ +import sys + +import atheris + +from fuzz_helpers import EnhancedFuzzedDataProvider + +with atheris.instrument_imports():

  • from pdf_utils import PDFValidator, prepare_pdfminer_fuzzing
  • from pdfminer.high_level import extract_text
  • +from pdfminer.psparser import PSException

  • +def TestOneInput(data: bytes):

The rest of pdfminer is type checked with mypy. That would error on the missing return type here. What do you think of adding the fuzzing directory to the noxfile.py?

That would also enable black formatting and linting.

In pdfminer/pdfdocument.py https://github.com/pdfminer/pdfminer.six/pull/949#discussion_r1656533878 :

@@ -977,7 +977,7 @@ def find_xref(self, parser: PDFParser) -> int: else: raise PDFNoValidXRef("Unexpected EOF") log.debug("xref found: pos=%r", prev)

  • assert prev is not None
  • assert prev is not None and prev.isdigit()

I'm assuming this is already the first bug that you find by using fuzzing. While running the code myself I found a couple of others.

What do you think of separating fixing errors into separate MR's. I'm hoping that we get some great statistics from ossfuzz on how many errors we fixed this way.

— Reply to this email directly, view it on GitHub https://github.com/pdfminer/pdfminer.six/pull/949#pullrequestreview-2144344679, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHXFHTLUGS64V4QMKDVZPLDZJOXJPAVCNFSM6AAAAABEPH7OO6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCNBUGM2DINRXHE . You are receiving this because you were mentioned.Message ID: @.***>

pietermarsman commented 4 months ago

I've fixed all my own comments and think this is now ready.

One big change I did was to subclass all exceptions that pdfminer raises from PSException. Such that the exception handling is now a bit easier. Encapsulating this is also good for the package.

If you confirm that the current code still works with ClusterFuzz I'll merge it.

capuanob commented 4 months ago

Awesome, thank you so much for helping out with this effort!

Also definitely a great addition to have a base class for raised exceptions, I tend to look out for that to make the catch block clearer.

Looks good to me, whenever you merge I’ll contribute to Google’s upstream.

Thank You, Bailey Capuano

On Thu, Jun 27, 2024 at 4:18 PM Pieter Marsman @.***> wrote:

I've fixed all my own comments and think this is now ready.

One big change I did was to subclass all exceptions that pdfminer raises from PSException. Such that the exception handling is now a bit easier. Encapsulating this is also good for the package.

— Reply to this email directly, view it on GitHub https://github.com/pdfminer/pdfminer.six/pull/949#issuecomment-2195600496, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHXFHTPGV66AMQ4G3VQQD43ZJRXQ5AVCNFSM6AAAAABEPH7OO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJVGYYDANBZGY . You are receiving this because you were mentioned.Message ID: @.***>

capuanob commented 4 months ago

@pietermarsman Can confirm that all builds succeed with ClusterFuzz

pietermarsman commented 4 months ago

Done. Thanks for everything! 👏

pietermarsman commented 4 months ago

Hi @capuanob ,

I've checked out OSS-Fuzz and monorail but could not find any useful output yet. The build seems to succeed. But there are no issues opened yet. And the coverage suggests that nothing gets past the is_valid_byte_stream check. Maybe the corpus is not loaded correctly.

capuanob commented 4 months ago

Hi Pieter, I've got this on my docket to look into. I may have some time this weekend, but I am in the middle of a move so my time is limited. If you happen to find anything before then, please do let me know!

Best, Bailey+

On Wed, Jul 3, 2024 at 11:33 AM Pieter Marsman @.***> wrote:

Hi @capuanob https://github.com/capuanob ,

I've checked out OSS-Fuzz https://oss-fuzz.com/ and monorail https://bugs.chromium.org/p/oss-fuzz/issues/list?q=&can=2 but could not find any useful output yet. The build seems to succeed. But there are no issues opened yet. And the coverage suggests that nothing gets past the is_valid_byte_stream check. Maybe the corpus is not loaded correctly.

— Reply to this email directly, view it on GitHub https://github.com/pdfminer/pdfminer.six/pull/949#issuecomment-2206577530, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHXFHTL74RVNEQBFYNXOHBDZKQKVZAVCNFSM6AAAAABEPH7OO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBWGU3TONJTGA . You are receiving this because you were mentioned.Message ID: @.***>

pietermarsman commented 4 months ago

@capuanob

Take your time. Good luck with the move!

Some thoughts:

  1. Not sure what changed, but somethings seems to be working now. There are 3 issues now.
  2. But the coverage is still very low.
  3. I noticed here the corpus is copied to $SRC/corpus but the working dir is $SRC/pdfminer.six. I'm not sure if that is correct. Seems like the corpus is outside the workdir.
  4. I noticed there are integration rewards. Are you applying for those?
  5. I noticed there is the option to [file GitHub issues] as well. Can you set that up for pdfminer.six?
capuanob commented 4 months ago

Good morning,

From a quick review, I think you’re right- it looks like I just copied the corpus without zipping it to the $OUT directory with the proper name. I’ll fix that in a subsequent PR to pdfminer. I am undergoing the rewards process as well. As for the GitHub issues, I will integrate those!

Thank You, Bailey Capuano

On Sat, Jul 6, 2024 at 12:11 PM Pieter Marsman @.***> wrote:

@capuanob https://github.com/capuanob

Take your time. Good luck with the move!

Some thoughts:

  1. Not sure what changed, but somethings seems to be working now. There are 3 issues https://oss-fuzz.com/testcases?project=pdfminersix&open=yes now.
  2. But the coverage https://storage.googleapis.com/oss-fuzz-introspector/pdfminersix/inspector-report/20240706/fuzz_report.html#High-level-conclusions is still very low.
  3. I noticed here https://github.com/google/oss-fuzz/blob/master/projects/pdfminersix/Dockerfile the corpus is copied to $SRC/corpus but the working dir is $SRC/pdfminer.six. I'm not sure if that is correct. Seems like the corpus is outside the workdir.
  4. I noticed there are integration rewards https://google.github.io/oss-fuzz/getting-started/integration-rewards/. Are you applying for those?
  5. I noticed there is the option to [file GitHub issues] as well. Can you set that up for pdfminer.six?

— Reply to this email directly, view it on GitHub https://github.com/pdfminer/pdfminer.six/pull/949#issuecomment-2211808723, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHXFHTPW7NUUSIYMPYJTJVDZLAJKXAVCNFSM6AAAAABEPH7OO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJRHAYDQNZSGM . You are receiving this because you were mentioned.Message ID: @.***>

capuanob commented 4 months ago

@pietermarsman Put in a PR with Google to add GitHub issues, feel free to follow here

pietermarsman commented 4 months ago

Thanks for integrating the GitHub issues. And :crossed_fingers: your latest PR fixes the coverage. As for the reward integration rewards, am I eligible as well? A reward for working on OSS, that almost seams to good to be true :smiley:

capuanob commented 4 months ago

I’m unsure, I’ve only done this side of the house. You’d probably have to reach out to OSSFuzz about the maintainers side of things!

Thank You, Bailey Capuano

On Mon, Jul 8, 2024 at 2:00 AM Pieter Marsman @.***> wrote:

Thanks for integrating the GitHub issues. And 🤞 your latest PR fixes the coverage. As for the reward integration rewards, am I eligible as well? A reward for working on OSS, that almost seams to good to be true 😃

— Reply to this email directly, view it on GitHub https://github.com/pdfminer/pdfminer.six/pull/949#issuecomment-2213091894, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHXFHTMOSUUEUO5MPMMPF3LZLITGBAVCNFSM6AAAAABEPH7OO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJTGA4TCOBZGQ . You are receiving this because you were mentioned.Message ID: @.***>