Support for marked content section IDs

jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

MIT License

6.48k stars 658 forks source link

Support for marked content section IDs #961

Closed dhdaines closed 1 year ago

dhdaines commented 1 year ago

As requested, this is the MCID part of #937 split out. Structure tree support (using pdfminer.six) will be a separate PR.

codecov[bot] commented 1 year ago

Codecov Report

Merging #961 (8b5b6a3) into develop (d8b9c15) will not change coverage. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##           develop      #961   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           18        18           
  Lines         1588      1613   +25     
=========================================
+ Hits          1588      1613   +25

Files Changed	Coverage Δ
pdfplumber/page.py	`100.00% <100.00%> (ø)`

dhdaines commented 1 year ago

Note! This page only extracts marked-content identifiers for sequences of objects. There ~are a few other kinds~ is one kind of marked content that exist in PDF which it doesn't handle:

~marked-content sequences with tags (and no identifiers) - This PR will be fixed ASAP to support this~ DONE!
marked-content points - These are marked points (with a tag and possible attributes) in the content stream which don't correspond to any given object. It isn't clear how this could be supported in pdfplumber.

jsvine commented 1 year ago

Many thanks for this, @dhdaines! It's a clever solution, and adds what seems like will be a powerful feature for people working with PDFs that have marked content.

For now, I'm going to mark mcid and tag in the README as experimental attributes, but will remove that note if/when the pdfminer.six internals that make this possible remain stable.

dhdaines commented 1 year ago

Many thanks for this, @dhdaines! It's a clever solution, and adds what seems like will be a powerful feature for people working with PDFs that have marked content.

For now, I'm going to mark mcid and tag in the README as experimental attributes, but will remove that note if/when the pdfminer.six internals that make this possible remain stable.

Thank you! I will submit another PR soon to add the tag attributes, as these are useful for identifying headers and footers.