internetarchive / archive-pdf-tools

Fast PDF generation and compression. Deals with millions of pages daily.
https://archive-pdf-tools.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
97 stars 13 forks source link

PDF/UA improvements #17

Open MerlijnWajer opened 3 years ago

MerlijnWajer commented 3 years ago

VeraPDF now supports PDF/UA verification:

~/verapdf/verapdf --format xml --flavour ua1 /tmp/test.pdf  > /tmp/out.xml

We should fix the problems that it finds with our PDFs, I suspect that this will also help with the problems that Adobe finds.

This means at least:

MerlijnWajer commented 3 years ago

On alt text: https://stackoverflow.com/questions/34036200/add-alternative-text-for-an-image-in-tagged-pdf-pdf-ua-using-itext

The PDF spec chapter 10.6 and 10.7 are relevant: https://ghostscript.com/~robin/pdf_reference17.pdf

MerlijnWajer commented 3 years ago

The natural language for text blocks could be determined from the hOCR lang attributes, perhaps

MerlijnWajer commented 3 years ago

https://blog.adobe.com/en/publish/2017/07/25/accessible-pdfs-in-acrobat-dc-tagging-content-as-an-artifact.html https://www.uottawa.ca/respect/sites/www.uottawa.ca.respect/files/fss-fixing-accessibility-errors-in-pdfs.pdf