jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.02k stars 619 forks source link

Support for PDF 1.3 logical structure #963

Closed dhdaines closed 8 months ago

dhdaines commented 11 months ago

As promised, here is the other PR supporting the structure tree using pdfminer.six - so no overhead and no typing weirdness. In the end the implementation is ~rather nice and simple.~ somewhat complex once we take into account the multiplicity of optional features in the structure tree specification.

There is one caveat, which is mentioned in the docstring: whereas other PDF engines will include empty structure elements in the structure tree, this implementation does not, for kind of the same reason that #961 doesn't do anything for marked structure points. Since pdfplumber is based around extracting objects from the PDF, it isn't very useful to have structure that can't be associated to any objects, at least in my opinion.

Also, in the case where there are unparsed pages in a PDF, it isn't quite clear what to do about structure elements with no explicit page ID, unless we assume that elements with no marked content are always excluded.

But, if you like, we can (optionally?) add these structure elements, it isn't too hard to do.

codecov[bot] commented 11 months ago

Codecov Report

Merging #963 (036044d) into develop (336f83f) will not change coverage. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           develop      #963    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           18        19     +1     
  Lines         1613      1897   +284     
==========================================
+ Hits          1613      1897   +284     
Files Changed Coverage Δ
pdfplumber/cli.py 100.00% <100.00%> (ø)
pdfplumber/page.py 100.00% <100.00%> (ø)
pdfplumber/pdf.py 100.00% <100.00%> (ø)
pdfplumber/structure.py 100.00% <100.00%> (ø)
dhdaines commented 11 months ago

This should be complete now, I'll let you review it at your leisure! It works for me (tm)

dhdaines commented 10 months ago

Hi! If you get a chance could you review this soon? The test suite is now pretty extensive since I learned how to create "synthetic" PDFs with a text editor, and I've removed all but one of the pragma: nocover comments (the remaining one is a "shouldn't happen" case).

I think it is really a good implementation of PDF logical structure though obviously there will be weird PDFs out there that do undefined behaviour!

jsvine commented 8 months ago

Thanks for this, @dhdaines! My apologies for not getting to it sooner; it took me a little while to wrap my head around it. Now merged. One quick follow-up: Want to note the method in the README.md, summarized however you best see fit (or just linking to your docs/structure.md file?

dhdaines commented 8 months ago

Thanks for this, @dhdaines! My apologies for not getting to it sooner; it took me a little while to wrap my head around it. Now merged. One quick follow-up: Want to note the method in the README.md, summarized however you best see fit (or just linking to your docs/structure.md file?

Ah! You're right, it ought to be in README.md, I thought that I had put it there. I can submit another PR for this.

dhdaines commented 8 months ago

Thanks for merging as well! No problem about the delay, it is a large and complex feature. There is one quirk to the implementation that might require a follow-on: structure elements are allowed to span multiple pages, which is complicated to handle properly because PDF is otherwise extremely page-oriented (marked content sections notably can't do this). This means that objects that are in the structure tree might not appear to be in some situations. I will file this as an issue once I find a good test case for it.

jsvine commented 8 months ago

Ah, interesting. I think I understand in theory, but not quite sure in practice — so looking forward to that test case. Thanks!