jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.1k stars 625 forks source link

Accessibility tagging #909

Open NathanTech7713 opened 1 year ago

NathanTech7713 commented 1 year ago

Hi there,

Was wondering if, when the dev is particularly bored, would you mind considering implementing extraction of accessibility tagging?

Thank youPlease describe, in as much detail as possible, your proposal and how it would improve your experience with pdfplumber.

jsvine commented 1 year ago

Hi @NathanTech7713 — thanks for your interest in this library, and for this suggestion. For my own notes and for others who may be less familiar:

And some general questions: What should the output of this extraction look like? A nested tree of tags? Something else?

@NathanTech7713: Do you have any examples of other PDF extraction libraries that have a feature like this, and which you think would provide a useful model?

dhdaines commented 1 year ago

Hi! I was about to make this same feature request. I've done a bit of exploration here as I am working on extracting the structure from PDFs and, obviously, it makes sense to use explicit structure if it's there... well, sort of.

Most of the libraries that support tagged PDF are closed-source, but some functionality to extract it exists in Poppler and pdf.js, and you can see the tags by running pdfinfo -struct on a PDF (or pdfinfo -struct-text to see the content of the tags as well). Unfortunately the generation of structure and tags is, to put it mildly, highly variable across different PDF authoring tools, and I haven't come remotely close to understanding the (very convoluted) specification. The W3C has a nice overview of logical structure and tagged PDF here: https://www.w3.org/TR/2014/NOTE-WCAG20-TECHS-20140408/pdf_notes.html

Basically there are a couple of moving parts, which you can find starting in section 10.5 of the PDF 1.7 spec (or maybe section 14, if you have the Adobe/ISO document?):

See https://github.com/dhdaines/alexi/blob/main/scripts/pdfstructure.py for a quick-and-dirty script (based on pdfminer.six code) which prints MCID sections and tags and attempts (but doesn't really succeed) to resolve the structure tree, and https://github.com/dhdaines/alexi/blob/main/test/data/pdf_structure.pdf for a test document with structure and tags.

dhdaines commented 1 year ago

What I would find minimally useful (but I can't speak for the original author of this issue) would be:

NathanTech7713 commented 1 year ago

Woops! Got to be honest, thought I replied and then didn't!

@dhdaines sums it up quite well in what I am also hoping for.

I think I mentioned quite a while ago about eventually wanting to put together an accessible PDf reader for screen reader (totally blind) users of windows, so and accessibility tagging would be a solid way of identifying structure.

jsvine commented 1 year ago

Thank you both, these very helpful notes/context. I can't promise I'll get to this soon, but it does seem worth trying to add.

dhdaines commented 1 year ago

Thank you both, these very helpful notes/context. I can't promise I'll get to this soon, but it does seem worth trying to add.

If it helps I can make a preliminary PR with something like what I mentioned above (extraction of marked content sections + structure tree parsing)

jsvine commented 1 year ago

@dhdaines Thanks for the offer! Is there a particular subset of this functionality that would be easiest to start trying to integrate into pdfplumber? (I.e., require the least modification of existing code / least performance impact.)

dhdaines commented 1 year ago

@dhdaines Thanks for the offer! Is there a particular subset of this functionality that would be easiest to start trying to integrate into pdfplumber? (I.e., require the least modification of existing code / least performance impact.)

At first glance - extracting the structure tree is relatively easy and can be done on-demand as it's all in the document catalog - linking it to the MCIDs might have more of performance impact, at least, with pdfminer.six, since it seems like we have to decode and parse the entire document to get them, even for a single page, but I could be mistaken about this!

jsvine commented 1 year ago

Thanks! That sounds like a reasonable place to start. I suppose we could expose that similarly to how we do with Page.annots — i.e., outside the main parsing function?

dhdaines commented 1 year ago

The pypdfium2 interface to the underlying pdfium API may be useful for this: https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_structtree.h

dhdaines commented 1 year ago

The pypdfium2 interface to the underlying pdfium API may be useful for this: https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_structtree.h

Actually this is quite easy. I should have a PR for you tonight or tomorrow, I hope.

dhdaines commented 1 year ago

Ready for review, see PR above. I'll test it more on my PDFs of interest, but it is functional and somewhat documented, see docs/structure.md and tests/test_structure.py for examples.

jsvine commented 1 year ago

Many thanks, @dhdaines, and a particular thanks for the documentation. It might take me a little while to review the PR, due to other workload and me being relatively new to the topic/feature, but on first glance, it seems like a helpful contribution.

jsvine commented 8 months ago

Now that #961 and #963 are merged, is this issue all clear to close? Or are there other features that would need to be in place for us to say we've handled accessibility tagging?

dhdaines commented 8 months ago

Thanks! There is at least one small add-on to consider - #961 doesn't give access to the tag attributes, only the tag name. These allow you to distinguish between different types of artifacts (header, footer, etc).

I'm not sure if we want to add them as a dictionary-valued attribute for each object in a marked content section, as this could produce large outputs (it shouldn't be a huge problem for memory consumption since it's the same dictionary...)

"Tagged PDF" is a fairly vaguely defined standard (or perhaps I just don't fully understand it yet) so there may be other things too.

jsvine commented 8 months ago

Thanks, @dhdaines. A couple of follow-up questions:

I'm not sure if we want to add them as a dictionary-valued attribute for each object in a marked content section

Could you share an example of what this would look like?

as this could produce large outputs

I agree with the general inclination here. Could we have it both ways and allow users to opt-in to this additional output?