internetarchive / archive-pdf-tools

Fast PDF generation and compression. Deals with millions of pages daily.
https://archive-pdf-tools.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
86 stars 13 forks source link

Q: accessible tagging/hints? #65

Closed jrochkind closed 1 year ago

jrochkind commented 1 year ago

I've really been following in @MerlijnWajer 's footsteps as I try to understand the landscape for OCR PDF generation!

I found this video of a conference presentation Merlijn did, which was enormously helpful.

https://youtu.be/DqA1YPfDlhg?t=1110

At the timecode I bookmarked there, Merlijn says:

they also contain certain hints for screen-readers for people who are blind to be able to still read what's in a PDF

I think those "certain hints" are probably what's referred to as as "tags"?

I am very curious what it is you are putting in, and how you get them in there, presumably through a fully automated process -- what tools you are using to do that, and if they are open source/shared. Or, for the purpose of this repository specifically -- is that included in the recode_pdf process already, or is that coming from another tool?

And same question for PDF/A in general (or maybe that's all you meant?) -- in the presentation you talk about colorspaces and such for PDF/A, is that included in recode_pdf, or is that another tool you use?

Sorry if this isn't a great place to ask this question, feel free to tell me if there's a better discussion forum or way to get in touch with you to learn more about what you are doing. Thank you for all the work you have done on open source tools useful in this area!

jrochkind commented 1 year ago

PPS: This is really not about this issue, but just trying to communicate with @MerlijnWajer to some things you might be interested in...

In your video presentation, you talk about how tesseract can only create PDF's as part of scanning, without compression -- I am no sure if you are aware of something I only recently discovered, you can actually have tesseract create "text-layer only" PDFs too, with the -c textonly_pdf=1 flag. It is still outputting PDFs, although they are pretty small ones containing only invisible text. Not as ideal for an engineering pipeline, but another option in which you can still apply your own choice of image resolution and compression.

And one more thing -- from that presentation I just learned from you of the hocrjs tool, which is cool. I wonder if you are aware of this other little-known tool I found, which is really cool and a proof-of-concept of in-browser editing of hocr too. https://github.com/not-implemented/hocr-proofreader https://www.not-implemented.de/hocr-proofreader/

MerlijnWajer commented 1 year ago

Thanks for showing interest - maybe some collaboration is possible here. :-)

As an aside, if you do want to reach out directly, you can find my email in the commits, or on the top of this document: https://archive.org/developers/pdf.html (or here: https://archive.org/developers/ocr.html)

If my memory serves me correct, I believe the current PDF/UA hints that we insert are only basic ones. For example, every page on Archive.org books has a set of images as "background" (normally just one, with MRC two + alpha layer). To prevent screen readers from even mentioning that this image exists on the page, we write certain tags to have it ignore these. The other thing we do is to add some additional structure to the PDF. There is a function called write_basic_ua which I believe does something like this.

So in short:

I think those "certain hints" are probably what's referred to as as "tags"?

Yes, that's correct.

I'd love to add better support for PDF/UA, but it's a complicated topic and requires some dedication.

MerlijnWajer commented 1 year ago

PPS: This is really not about this issue, but just trying to communicate with @MerlijnWajer to some things you might be interested in...

(Feel free to email)

In your video presentation, you talk about how tesseract can only create PDF's as part of scanning, without compression -- I am no sure if you are aware of something I only recently discovered, you can actually have tesseract create "text-layer only" PDFs too, with the -c textonly_pdf=1 flag. It is still outputting PDFs, although they are pretty small ones containing only invisible text. Not as ideal for an engineering pipeline, but another option in which you can still apply your own choice of image resolution and compression.

Yes, you can make PDFs with just text directly with Tesseract, but that would not fit the Archive.org workflow. I believe somewhere in that presentation I do touch on this, although quite lightly. Basically, we do OCR as a separate step from PDF generation, so I had to be able to make PDFs from hOCR files. And Tesseract can only make PDFs from its internal in-memory structure, so I just re-write the Tesseract PDF generating code in Python and made it capable of reading hOCR files.

And one more thing -- from that presentation I just learned from you of the hocrjs tool, which is cool. I wonder if you are aware of this other little-known tool I found, which is really cool and a proof-of-concept of in-browser editing of hocr too. https://github.com/not-implemented/hocr-proofreader https://www.not-implemented.de/hocr-proofreader/

Yeah, I am aware of proofreader and also thought it was a really neat project. I don't think it has seen any attention recently. On a personal note, what I'd like to support in the Archive.org stack is some kind of way for people to correct OCR results, which is what I was thinking of using hOCR proofreader for.

hocrjs is integrated as a service for Archive.org for viewing purposes only: https://archive.org/services/hocr-view/view?identifier=sim_english-illustrated-magazine_1884-12_2_15

jrochkind commented 1 year ago

Thanks! And that limited tagging is added by recode_pdf? Cool.