jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Handle invalid metadata values #320

Closed samkit-jain closed 3 years ago

samkit-jain commented 3 years ago

This PR fixes issue #316

Background: The PDF exhibiting the exception contained a metadata key Changes which was a list of PDFObjRef objects. The code responsible for resolving metadata expected a list to be of string-like objects and since in this case, it wasn't, the decode_text() function raised the TypeError exception as mentioned in the issue.

This metadata value type is not as per the PDF specification (see p556 sec14.3). PDF specifications are well, just specifications. Not conforming to it does not necessarily render a PDF unreadable and a lot of PDFs don't abide by every single rule mentioned in it. Hence, in the first commit, I added a try-except block that would log a warning to the console when a metadata value is unparseable and continues to the next one. This would allow the user to perform non-metadata related operations without any issue. I think we can also make this behaviour configurable by adding a new parameter, say, fail_on_metadata in the PDF.open() method that when set to True, will raise an exception and when False, just log to the console.

Fix: To resolve the issue, I have added a new method resolve_and_decode() which would recursively go through the metadata values and decode them.

Additional: Pinned versions of pytest and removed unused imports in tests/test_issues.py.

codecov[bot] commented 3 years ago

Codecov Report

Merging #320 (b5239ad) into develop (d3b84da) will increase coverage by 0.68%. The diff coverage is 83.33%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #320      +/-   ##
===========================================
+ Coverage    97.48%   98.17%   +0.68%     
===========================================
  Files           10       10              
  Lines         1193     1205      +12     
===========================================
+ Hits          1163     1183      +20     
+ Misses          30       22       -8     
Impacted Files Coverage Δ
pdfplumber/pdf.py 95.29% <60.00%> (-4.71%) :arrow_down:
pdfplumber/utils.py 100.00% <100.00%> (ø)
pdfplumber/display.py 88.81% <0.00%> (+7.45%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d3b84da...e849cbb. Read the comment docs.

jsvine commented 3 years ago

Thanks, @samkit-jain! I think this is a smart approach. Just a few notes / requests:

Thanks again!

samkit-jain commented 3 years ago

@jsvine I have made the metadata failure condition configurable. Since I am unable to create a PDF that would cause an exception in the new recursive resolver method, I was unable to increase the overall test coverage. Added tests for the draw methods to increase the coverage instead, but, this meant that the test coverage for the files changed in this PR is still below the previous one. That's why the codecov/patch is failing but codecov/project is passing. The changelog has been updated as well.

jsvine commented 3 years ago

Thanks, @samkit-jain! I'm slightly wary of incorporating new code that doesn't have a PDF to test it with, but have merged this since (a) the logic of the code makes sense, (b) the untested part of the code handles what would already be edge-cases, and (c) the tests pass for all the PDFs we are using.