jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Upgrade pdfminer.six from 20200517 to 20211012 #515

Closed jsvine closed 3 years ago

jsvine commented 3 years ago

See pdfminer.six's changelog for details: https://github.com/pdfminer/pdfminer.six/blob/develop/CHANGELOG.md

... but a key difference is an improvement in how it assigns line, rect, and curve objects. (Diagonal two-point lines, for instance, are now line objects instead of curve objects.)

As a result, this commit also adjusts some of the tests, where the pre-20211012 versions had been incorrectly assigning lines as LTCurve objects.

Note: This commit also tweaks CHANGELOG.md to call the next release 0.6.0, since this will be a breaking change for some use-cases — though an improvement, and necessary if we don't want pdfplumber to fall far out of sync with pdfminer.six.

codecov[bot] commented 3 years ago

Codecov Report

Merging #515 (5794aaa) into develop (7a65785) will not change coverage. The diff coverage is n/a.

:exclamation: Current head 5794aaa differs from pull request most recent head 4b61c38. Consider uploading reports for the commit 4b61c38 to get more accurate results Impacted file tree graph

@@           Coverage Diff            @@
##           develop     #515   +/-   ##
========================================
  Coverage    98.30%   98.30%           
========================================
  Files           10       10           
  Lines         1236     1236           
========================================
  Hits          1215     1215           
  Misses          21       21           

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7a65785...4b61c38. Read the comment docs.

jsvine commented 3 years ago

Good question, and I can see how that'd be confusing. Curve objects do still exist — they're paths with more than 2 points and which do not constitute a rectangle (or multiple consecutive rectangles). The fixes in the new version of pdfminer.six simply assign "LTLine" correctly in instances where it had previously been assigning "LTCurve," for instance diagonal lines. For an example of some paths that are still correctly curve objects, see the shapes on the final page of this PDF: http://www.pdfill.com/example/pdf_drawing_new.pdf