Open NathanTech7713 opened 1 year ago
Hi @NathanTech7713 — thanks for your interest in this library, and for this suggestion. For my own notes and for others who may be less familiar:
"Tagged PDFs" are documents that use the PDF spec's features for enabling accessibility, by adding semantic markup in the form of "tags": https://www.pdfa.org/wp-content/uploads/2019/06/TaggedPDFBestPracticeGuideSyntax.pdf
pdfminer.six
, the parser on which pdfplumber
depends, does seem to have some functionality for identifying and extracting these tags — however, that functionality comes from the TagExtractor
class, which is a subclass of the PDFDevice
class we use. Still, it's helpful to know that there's some functionality there, even if we'd have to patch it in.
And some general questions: What should the output of this extraction look like? A nested tree of tags? Something else?
@NathanTech7713: Do you have any examples of other PDF extraction libraries that have a feature like this, and which you think would provide a useful model?
Hi! I was about to make this same feature request. I've done a bit of exploration here as I am working on extracting the structure from PDFs and, obviously, it makes sense to use explicit structure if it's there... well, sort of.
Most of the libraries that support tagged PDF are closed-source, but some functionality to extract it exists in Poppler and pdf.js, and you can see the tags by running pdfinfo -struct
on a PDF (or pdfinfo -struct-text
to see the content of the tags as well). Unfortunately the generation of structure and tags is, to put it mildly, highly variable across different PDF authoring tools, and I haven't come remotely close to understanding the (very convoluted) specification. The W3C has a nice overview of logical structure and tagged PDF here: https://www.w3.org/TR/2014/NOTE-WCAG20-TECHS-20140408/pdf_notes.html
Basically there are a couple of moving parts, which you can find starting in section 10.5 of the PDF 1.7 spec (or maybe section 14, if you have the Adobe/ISO document?):
pdfminer.six
will give you if you use TagExtractor
which I think we can agree is a sub-optimal API (I am not really sure how it could be integrated in pdfplumber
). These are the sections of text/objects/whatever in the PDF that correspond to structural units. Sometimes they have meaningful tags attached directly to them (notably, LibreOffice will do this) but usually they are all tagged as "P" and have to look in the "logical structure" to get more useful information.StructTreeRoot
, RoleMap
, ParentTree
and sometimes ClassMap
entries in the document catalog. It's a horrible, cyclical (notably pdfminer.six
will crash with a stack overflow trying to resolve it) mess of PDFObject references. At some point (and there are multiple ways this can happen) you will end up at a leaf node which gives you a MCID that you can use to refer back to the marked content sections noted above. But they might be indirected through the ParentTree
because Reasons.See https://github.com/dhdaines/alexi/blob/main/scripts/pdfstructure.py for a quick-and-dirty script (based on pdfminer.six
code) which prints MCID sections and tags and attempts (but doesn't really succeed) to resolve the structure tree, and https://github.com/dhdaines/alexi/blob/main/test/data/pdf_structure.pdf for a test document with structure and tags.
What I would find minimally useful (but I can't speak for the original author of this issue) would be:
extract_words
, and some way to place words from extract_words
within a given content section (yes, this could just be done with the bounding box)Woops! Got to be honest, thought I replied and then didn't!
@dhdaines sums it up quite well in what I am also hoping for.
I think I mentioned quite a while ago about eventually wanting to put together an accessible PDf reader for screen reader (totally blind) users of windows, so and accessibility tagging would be a solid way of identifying structure.
Thank you both, these very helpful notes/context. I can't promise I'll get to this soon, but it does seem worth trying to add.
Thank you both, these very helpful notes/context. I can't promise I'll get to this soon, but it does seem worth trying to add.
If it helps I can make a preliminary PR with something like what I mentioned above (extraction of marked content sections + structure tree parsing)
@dhdaines Thanks for the offer! Is there a particular subset of this functionality that would be easiest to start trying to integrate into pdfplumber
? (I.e., require the least modification of existing code / least performance impact.)
@dhdaines Thanks for the offer! Is there a particular subset of this functionality that would be easiest to start trying to integrate into
pdfplumber
? (I.e., require the least modification of existing code / least performance impact.)
At first glance - extracting the structure tree is relatively easy and can be done on-demand as it's all in the document catalog - linking it to the MCIDs might have more of performance impact, at least, with pdfminer.six
, since it seems like we have to decode and parse the entire document to get them, even for a single page, but I could be mistaken about this!
Thanks! That sounds like a reasonable place to start. I suppose we could expose that similarly to how we do with Page.annots
— i.e., outside the main parsing function?
The pypdfium2
interface to the underlying pdfium API may be useful for this: https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_structtree.h
The
pypdfium2
interface to the underlying pdfium API may be useful for this: https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_structtree.h
Actually this is quite easy. I should have a PR for you tonight or tomorrow, I hope.
Ready for review, see PR above. I'll test it more on my PDFs of interest, but it is functional and somewhat documented, see docs/structure.md
and tests/test_structure.py
for examples.
Many thanks, @dhdaines, and a particular thanks for the documentation. It might take me a little while to review the PR, due to other workload and me being relatively new to the topic/feature, but on first glance, it seems like a helpful contribution.
Now that #961 and #963 are merged, is this issue all clear to close? Or are there other features that would need to be in place for us to say we've handled accessibility tagging?
Thanks! There is at least one small add-on to consider - #961 doesn't give access to the tag attributes, only the tag name. These allow you to distinguish between different types of artifacts (header, footer, etc).
I'm not sure if we want to add them as a dictionary-valued attribute for each object in a marked content section, as this could produce large outputs (it shouldn't be a huge problem for memory consumption since it's the same dictionary...)
"Tagged PDF" is a fairly vaguely defined standard (or perhaps I just don't fully understand it yet) so there may be other things too.
Thanks, @dhdaines. A couple of follow-up questions:
I'm not sure if we want to add them as a dictionary-valued attribute for each object in a marked content section
Could you share an example of what this would look like?
as this could produce large outputs
I agree with the general inclination here. Could we have it both ways and allow users to opt-in to this additional output?
Hi there,
Was wondering if, when the dev is particularly bored, would you mind considering implementing extraction of accessibility tagging?
Thank youPlease describe, in as much detail as possible, your proposal and how it would improve your experience with pdfplumber.