jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.48k stars 3.38k forks source link

Accessibility mode for LaTeX #5409

Open jgm opened 5 years ago

jgm commented 5 years ago

PDFs produced using latex are not accessible. We could introduce a command-line option that causes the latex writer to include annotations for math (perhaps using the unicode fallback or even raw tex), image alt text, and more: http://ctan.math.washington.edu/tex-archive/macros/latex/contrib/oberdiek/accsupp.pdf

This package also includes an option that makes spaces visible to copy and paste (often when you copy from a latex-compiled PDF, spaces disappear).

Structural elements (paragraphs, lists, etc.) need to be tagged, and reading order indicated.

See also: https://www.tug.org/twg/accessibility/ http://web.science.mq.edu.au/~ross/TaggedPDF/ https://tex.stackexchange.com/questions/124291/revisiting-producing-structured-pdfs-from-latex (with information about using ConTeXT)

u-fischer commented 4 years ago

@adityam While using H as structure element with deep nesting is valid pdf, the practice guide (https://www.pdfa.org/resource/tagged-pdf-best-practice-guide-syntax/) recommends not to use it:

Due to a lack of suitable tools, this structure element is impractical, and its use is not recommended

jmcastinheira commented 3 years ago

I don't understand why (I'm still learning all of this) but it seems that the pandoc pdf output has a big problem for a screen reader, the spaces between words are missing. You can check it by saving the pdf as text in Acrobat Reader. This happens with LuaLaTeX, XeLaTeX and ConTeX. With PdfLaTeX this problem does not exist, (it happens with some spaces, but not with all spaces).

I don't know if I should open a new issue with this

klpn commented 3 years ago

What is output of poppler pdftotext <filename>.pdf? Are the files PDF/A-conforming (you can check the metadata with ´pdfinfo` and verify with veraPDF)?

jmcastinheira commented 3 years ago

My file is made with the pandoc template, so it is not PDF/A. But even if I use \usepackage[a-2b]{pdfx} have this problem.

If I use pdftotex there is no problem. Maybe it's an Adobe issue, but it's important because some screen readers seems to use Adobe's conversion to work.

I used your tagged pdf and this pdf does not have this problem.

u-fischer commented 3 years ago

tagged.pdf is a properly tagged pdf, and it has been made with some lua code which inserts real space chars between words.

jgm commented 3 years ago

I've had some success using @u-fischer's tagpdf package. Try putting this near the top of your template.

\ifluatex
  \usepackage[luamode]{tagpdf}
  \tagpdfsetup{
      activate-all=true,
      interwordspace=true
   }
\fi

And compile with --pdf-engine=lualatex. At least this will help with the spaces-between-words problem.

u-fischer commented 3 years ago

@jgm But as a warning: tagpdf is experimental, and this is really meant so. The next version will require some new experimental code, which isn't yet compatible with every package and so tagpdf won't work will all documents for some time.

jmcastinheira commented 3 years ago

Thanks, but that didn't work for me, sorry. (I used \ifluatex instead of \iflualatex ).

Don't worry, I see it's not an easy problem, I'll use pdftex for now; and I will wait for the improvements of tagpdf.

Thanks to both of you

MarkWahlsten commented 3 years ago

I'm not sure if this is is useful example or not, but the web app PAVE uses itext...?

PAVE validates, autotags, and allows manual adding/removal of tags, alt text, reading order and metadata (like setting language and title).

Will support for PDF/UA-1 output be on the cards?

PDF/UA Reference Suite 1.1 – PDF Association Technical Implementation Guide PDF/UA – AIIM

jgm commented 3 years ago

@MarkWahlsten - this thread is specifically about LaTeX output; itext looks like a non-LaTeX PDF generation library, so it's not really relevant here. (Note pandoc already has lots of non-LaTeX routes to PDF; see the different possible arguments for --pdf-engine. If I recall, some of these already support tagging more fully than the LaTeX route. But the LaTeX route gives the best output, especially if your document contains math, so we'd really like a solution here.)

On pdfa see also #3215 and #5608.

jmcastinheira commented 3 years ago

@u-fischer, @jgm I was finally able to solve this spaces-between-words problem by removing the microtype package from the Pandoc's template and ussing

 \usepackage[luamode]{tagpdf}
  \tagpdfsetup{
      activate-all=true,
      interwordspace=true
   }

microtype package causes some problems in interaction with tagpdf. But it works if you put microtype before than tagpdf.

Thanks again.

tarleb commented 1 year ago

We've added a section on PDF accessibility to the manual; it summarizes the currently available methods to produce tagged PDFs.

u-fischer commented 1 year ago

We've added a section on PDF accessibility to the manual; it summarizes the currently available methods to produce tagged PDFs.

For a number of documents (documents using the standard classes and a restricted set of packages), tagging is readily available and with the june release the number will be growing. User who want to try should have a current LaTeX, use at best lualatex-dev, and start their document with \DocumentMetadata{testphase={phase-III}} or even (if they are daring) \DocumentMetadata{testphase={phase-III,math}}. Feedback can be given at https://github.com/latex3/latex2e/discussions/1010 or at the tagpdf github https://github.com/u-fischer/tagpdf.

tarleb commented 1 week ago

Just a quick report: after reading the prototype usage instructions I tried to generate a tagged PDF via pandoc and LaTeX and got good results on many documents.

Here's my setup, using Docker images for better reproducability. Dockerfile:

FROM pandoc/latex:latest
RUN tlmgr install l3experimental pdfmanagement-testphase tagpdf

Built a Docker image with

docker build --tag pandoc/latex:tagging .

and then used it with

docker run --rm -t -v $PWD:/data -u $(id -u):$(id -g) pandoc/latex:tagging \
    --pdf-engine=lualatex \
    --to=tagged-pdf.lua \
    ...

The tagged-pdf.lua just adds the code snippet given in the article to the default template.

Template = [[\DocumentMetadata{
$if(lang)$
  lang        = $lang$,
$endif$
  pdfversion  = 2.0,
  pdfstandard = ua-2,
  pdfstandard = a-4f, %or a-4
  testphase   =
   {phase-III,
    title,
    table,
    math,
    firstaid}
}
]] .. pandoc.template.default 'latex'

function Writer(doc, opts)
  return pandoc.write(doc, 'latex', opts)
end

This seems to work nicely for many documents, but some docs fail with a message like ! Package tagpdf Error: there is no open structure on the stack.

Given the moderate success, I think it would be justified to add a snippet like that to the default template if the pdfa variable is set, similar to what we do for ConTeXt.

jgm commented 1 week ago

I think it would be justified to add a snippet like that to the default template if the pdfa variable is set, similar to what we do for ConTeXt.

Will this cause any problems if you don't have the latest latex3 tech and accessibility packages installed?

tarleb commented 1 week ago

Setting the pdfa variable would then cause a LaTeX compilation failure. So it would have to be well-documented.

u-fischer commented 1 week ago

if the pdfa variable is set

I wouldn't connect tagging to a general pdfa variable. While there are pdf/A-standards which require also a tagged PDF archivable and accessible are nevertheless two different things. Also the tagging code still errors in various cases and it would be a pain for users who only want pdf/a-2b or similar to have to handle that.
I would suggest some new, dedicated variable, e.g. tagging. Enabling tagging should really be an explicit choice and not be sneaked into existing user code.

--pdf-engine=lualatex

lualatex-dev will probably give better results. The system should be as current as possible as this is work in progress.

tarleb commented 1 week ago

Just to clarify: the pdfa variable currently has no effect on LaTeX, it just works with ConTeXt. But I agree that a different variable name might be better.

bpj commented 1 week ago

Yes please use another variable

adunning commented 6 days ago

Could I propose that the standard PDF output should be accessible by default, once the LaTeX project have finished their testing phase?

u-fischer commented 6 days ago

@tarleb

This seems to work nicely for many documents, but some docs fail with a message like ! Package tagpdf Error: there is no open structure on the stack.

If you have latex examples showing errors you can report them at the tagging-project github.