jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.48k stars 658 forks source link

Support for marked content section IDs #961

Closed dhdaines closed 1 year ago

dhdaines commented 1 year ago

As requested, this is the MCID part of #937 split out. Structure tree support (using pdfminer.six) will be a separate PR.

codecov[bot] commented 1 year ago

Codecov Report

Merging #961 (8b5b6a3) into develop (d8b9c15) will not change coverage. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##           develop      #961   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           18        18           
  Lines         1588      1613   +25     
=========================================
+ Hits          1588      1613   +25     
Files Changed Coverage Δ
pdfplumber/page.py 100.00% <100.00%> (ø)
dhdaines commented 1 year ago

Note! This page only extracts marked-content identifiers for sequences of objects. There ~are a few other kinds~ is one kind of marked content that exist in PDF which it doesn't handle:

jsvine commented 1 year ago

Many thanks for this, @dhdaines! It's a clever solution, and adds what seems like will be a powerful feature for people working with PDFs that have marked content.

For now, I'm going to mark mcid and tag in the README as experimental attributes, but will remove that note if/when the pdfminer.six internals that make this possible remain stable.

dhdaines commented 1 year ago

Many thanks for this, @dhdaines! It's a clever solution, and adds what seems like will be a powerful feature for people working with PDFs that have marked content.

For now, I'm going to mark mcid and tag in the README as experimental attributes, but will remove that note if/when the pdfminer.six internals that make this possible remain stable.

Thank you! I will submit another PR soon to add the tag attributes, as these are useful for identifying headers and footers.