jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.73k stars 3.33k forks source link

Accessibility mode for LaTeX #5409

Open jgm opened 5 years ago

jgm commented 5 years ago

PDFs produced using latex are not accessible. We could introduce a command-line option that causes the latex writer to include annotations for math (perhaps using the unicode fallback or even raw tex), image alt text, and more: http://ctan.math.washington.edu/tex-archive/macros/latex/contrib/oberdiek/accsupp.pdf

This package also includes an option that makes spaces visible to copy and paste (often when you copy from a latex-compiled PDF, spaces disappear).

Structural elements (paragraphs, lists, etc.) need to be tagged, and reading order indicated.

See also: https://www.tug.org/twg/accessibility/ http://web.science.mq.edu.au/~ross/TaggedPDF/ https://tex.stackexchange.com/questions/124291/revisiting-producing-structured-pdfs-from-latex (with information about using ConTeXT)

frastlin commented 5 years ago

Note that tagged PDFs are starting to be required at confrances such as SIG Access and ICAD Governments around the world, such as the United States, Ontario, Australia, European Union, and many other governments all require at minimum, all government PDFs to be properly tagged. This means any university receiving government money in the U.S. needs to have all their content be WCAG compliant. This means that if Pandoc has no way to produce properly tagged PDFs, it will not be legally usable by any institution that falls under the above mandates. I would rate this as an extremely high priority as the U.S. started requiring accessible PDFs from all government and entities receiving government money in January 2018 and EU started requiring any government sector website to have only accessible PDFs produced starting on September 23 2018. So millions of PDFs are effected by these requirements.

jgm commented 5 years ago

Agreed, it's an important issue. It also comes up for materials distributed in connection with courses. I'm motivated to make it easier to produce accessible PDFs using pandoc, but I need some guidance on the LaTeX side.

Brandon notifications@github.com writes:

Note that tagged PDFs are starting to be required at confrances such as SIG Access and ICAD Governments around the world, such as the United States, Ontario, Australia, European Union, and many other governments all require at minimum, all government PDFs to be properly tagged. This means any university receiving government money in the U.S. needs to have all their content be WCAG compliant. This means that if Pandoc has no way to produce properly tagged PDFs, it will not be legally usable by any institution that falls under the above mandates. I would rate this as an extremely high priority as the U.S. started requiring accessible PDFs from all government and entities receiving government money in January 2018 and EU started requiring any government sector website to have only accessible PDFs produced starting on September 23 2018. So millions of PDFs are effected by these requirements.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/5409#issuecomment-478716398

frastlin commented 5 years ago

Note that I've found none of the PDFs produced by Pandoc to be accessible, even from the HTML to PDF engines. Apparently PDFLib produces tagged PDFs, but that is it. The only way I have found to create accessible PDFs from Pandoc is to use Microsoft Word or Open Office to generate the accessible PDF.

frastlin commented 5 years ago

Yes, if your university receives government money, or has an internal mandate to be accessible, you're required to have accessible content. There are 2 options with Pandoc:

  1. Produce HTML or Epub, which are accessible (with proper formatting) right out of the box with Pandoc.
  2. Use Word or Open Office (Make sure "Tagged PDFs is checked).
mb21 commented 5 years ago

btw. PDF/A was already brought up once, and implemented using the context writer, see https://github.com/jgm/pandoc/issues/3215

adunning commented 5 years ago

There's no question that this is important, but it needs more support to complete the LaTeX implementation; @u-fischer has been doing some great work with https://ctan.org/pkg/tagpdf.

jgm commented 5 years ago

Relevant pandoc-discuss thread

frastlin commented 5 years ago

Just received this information from another Pandoc user on accessibility-meta.sty:

Revisiting producing structured PDF from LaTeX (2015) -- provides some useful tips on creating hopefully accessible PDFs with accessibility-meta.sty. There is a link to Github but it no longer works. i will provide one below the following link which is from Stack Exchange. If you are using Firefox you can cut out all of the clutter by pressing either F9 or Control+Alt+R depending on whether you are on Windows or Linux. If you are on a Mac I seem to remember the command being Command + Shift + R. I suspect you already know this though. :) https://tex.stackexchange.com/questions/124291/revisiting-producing-structured-pdfs-from-latex

Andy Clifton's Github repo for accessibility-meta.sty is: https://github.com/AndyClifton/AccessibleMetaClass

He calls this meta-class now so things may have changed somewhat. I should warn you that the most recent commit appears to be from 2 years ago. I should also say that I have not tried this myself recently.

jgm commented 5 years ago

I've tried using accessibility-meta.sty but without any success.

u-fischer commented 5 years ago

@frastlin accessibility-meta doesn't work with luatex (and as with pdflatex you need to manually set all page breaks that is quite a problem). Also it isn't really extensible, e.g. to specific journal classes.

frastlin commented 5 years ago

Good to know. So it is looking like tagpdf is the best option for now. I would be more than happy to beta test the UX of tagged PDFs from Pandoc using tagpdf from a screen reader's perspective. I know almost nothing about LaTeX, so any testing I'll do will be from Markdown or HTML.

u-fischer commented 5 years ago

@frastlin I would be grateful if you could check the documentation (http://mirrors.ctan.org/macros/latex/contrib/tagpdf/tagpdf.pdf) and give some feedback. (I know that it has issues - but do find it difficult to judge how serious they are).

frastlin commented 5 years ago

I opened it and here are my comments:

  1. when I opened it, the first message I got was: "Cannot extract the embedded font 'OZCXQN+LMSans10-Bold'. Some characters may not display or print correctly."
  2. Acrobat does not ask me how to read the document, so first check passed.
  3. Love the headings!
  4. In the table of contents, the 1 doesn't have a link when all the other numbers do. I'm not sure why the numbers have the headings when the name of the heading is the name. Normally, in manuals, word table of contents, and Pandoc table of contents, the whole name of the heading is the link. When it is just the number, it's not always clear if the number is before or after the label, so I would much prefer the whole name be the link. I would also like the table of contents to be in a list. Here is what I see now:
1.
Introduction
link 2
1.1.
Tagging and accessibility...............................
link 3
1.2.
Engines and modes..................................
link 3
1.3.
References.......................................

(I added "Link" before the linked items). Here is what I would like:

List with 4 items
link 1. Introduction
link 1.1. Tagging and accessibility...............................
link 1.2. Engines and modes..................................
Link 1.3. References.......................................

Also note that links don't do anything when clicked.

  1. I'm not seeing links for references. I see the [1], but it's not a link.
  2. The list at 1.4. Validation, has the • on another line than the text, so it looks like:

    • One must check that the pdf is syntactically correct. It is rather easy to create broken pdf: e.g. if a chunk is opened on one page but closed on the next page.

Rather than:

• One must check that the pdf is syntactically correct. It is rather easy to create broken pdf: e.g. if a chunk is opened on one page but closed on the next page.
  1. 2.2. Setup and activation has a list that has no bullets, dashes, or numbers to differentiate the list items, but I can tell it's a list with 15 items.
  2. I like the alt text: "PAC3 report" which is the first graphic.

This is very good, and I would use it today if I could! I would like to test tables if you could give me a document with tables.

u-fischer commented 5 years ago

@frastlin thank you very much for the comments. I copied them to https://github.com/u-fischer/tagpdf/issues/15 and commented there as this is not really a pandoc question.

bulrush15 commented 5 years ago

I have experience making PDFs readable by the computer voice so feel free to contact me. Here is what I posted on the pandoc mailing list:

In 2012 I worked with US public school student standardized tests in PDF format that had to be read by the computer voice for people with visual disabilities. Large-print PDFs were not enough for them. I can't remember the requirement law for US states that required this but it was a requirement for every US public school. What we discovered was all the text in a PDF is in a random order when you look at the actual internal structure of the PDF. So the computer read the text in a random order. I don't think this has changed in the PDF internal structure. What that meant for US states is we had to manually reorder every word in the PDF by hand which was enormously expensive and time-consuming.

I'm not sure what program made the PDF, all we received was the PDF to work with. It could have been from Quark as Quark is infamous for putting elements in random order when you export to a text file or Excel file. Maybe a PDF made from MS Word would be in a better order.

If you use Quark, you are severely limited with what you can do with that data later. If you want to export it to a text file and do something with the data an lot of time and expense will be used to clean the data up first and put it in a proper order and consistent manner. (My daily paid job is processing text files from various applications.)

klpn commented 5 years ago

If you use --pdf-engine=context, a tagged PDF is produced by default. Moreover, the option pdfa creates a PDF/A-1b as standard, but if the option format=PDF/A-1b:2005, to setupbackend in the ConTeXt template, is changed to e.g PDF/A-2a , a PDF/A-2a (where the requirements include tagging) is produced instead. I have succeeded in validating files produced this way against the PDF/A-2a profile in veraPDF (the EU Preforma Project standard validator).

bulrush15 commented 5 years ago

@klpn If we produce a PDF/A-2a will the words be read by the computer in the proper order?

frastlin commented 5 years ago

A quick way to see how the computer reads the order is to select all and paste the output into a text file. The more difficult tags like heading, link, and table, need a viewer to check. But for headings, the text should be on its own line, similar if you paste the following content into a text editor:

Test Heading 1

This text will be on the line below the heading if you paste it into a text editor. If you have a PDF that is not tagged, and you don't have a program that can view the tags, then the heading will be on the same line as this text.

klpn commented 5 years ago

The text is in correct order for the files I have tested.

klpn commented 5 years ago

The tags, with their textual content, can be inspected e.g. with the Poppler pdfinfo program, like pdfinfo -struct-text [pdffile]. The default ConTeXt template in Pandoc 2.7.2 seems to destroy word boundaries in this output. I changed it according to the ConTeXt wiki (gist with diff), which solves this problem for the files I have tested.

mb21 commented 5 years ago

@klpn Any downsides when using your gist? If not, would you like to make a pull request? For context: the pdfa tempalte variable was added in https://github.com/jgm/pandoc/commit/46f4238a2a40b5542612bc745e63ce503ce12a32

klpn commented 5 years ago

I have not discovered any problems, but I should perhaps test with some more documents. However, the Pandoc manual explicitly states that the pdfa variable "adds to the preamble the setup necessary to generate PDF/A-1b:2005", so this should then be changed as well, if we want to always use 2a (i.e. version 2, level A conformance). When using PDF/A for documents born digital, it is best to use level A (which includes Unicode mapping and tagging) if possible, rather than B, but some older preservation guidelines still require version 1. Perhaps, the pdfa variable should be changed so that the user can choose which version of PDF/A to use (different PDF:s supported by ConTeXt)?

mb21 commented 5 years ago

@klpn I've created a new issue about the ConTeXt output: https://github.com/jgm/pandoc/issues/5608 Let's continue the discussion there in order to not spam this issue (which is about LaTeX output).

klpn commented 4 years ago

The main disadvantage with the ConTeXt solution, I think, is that there is a lot of functionality implemented in LaTeX (e,g, beamer) where a ConTeXt reimplementation would be cumbersome. The tagpdf package, which has been mentioned, could be used to tag LaTeX documents. It does not add tags automatically, however. Perphaps, tags can be injected in the Pandoc AST, to create a structure like that shown in the tagpdf manual, sec. 3.5. I guess this would be hard to do using filters, and would rather require changes in the LaTeX writer?

jgm commented 4 years ago

I think it would be possible (and not too hard) to add these tags using a lua filter. I don't see anything that would require changes to the writer.

For example, to get

\tagstructbegin{tag=H}
\tagmcbegin{tag=H}
\section{Section}
\tagmcend
\tagstructend

we'd have a filter like (untested)

function tagBlock(label, el)
  return { pandoc.RawBlock("latex", "\\tagstructbegin{tag=" .. label ..
                  "}\n\\tagmcbegin{tag=" .. label .. "}", el,
                  pandoc.RawBlock("latex", "\\tagmcend\n\\tagstructend") }
end

function Header(el)
  return tagBlock("H", el)
end

And of course you can use tagBlock for other block-level elements too. To get the Sect tags you'd use mkSections first to get section Divs.

klpn commented 4 years ago

Thanks, I will experiment a bit more with Lua filters and see if I can get accessible PDFs. Once we have properly tagged PDFs, it should also be possible to get PDF/A Level A from LaTeX via the pdfx package.

klpn commented 4 years ago

A problem with a solution like that proposed by @jgm is for Beamer slides. This

# Pixedit

* Converts Office files

yields

\begin{frame}
\tagstructbegin{tag=H}
\tagmcbegin{tag=H}
\end{frame}

\begin{frame}{Pixedit}
\protect\hypertarget{pixedit}{}
\tagmcend
\tagstructend

\begin{itemize}
\tightlist
\item
  Converts Office files

The initial tagging commands before the header are placed in an empty frame, due to the way the writer divides frames from the Header structure when using beamer as output format.

u-fischer commented 4 years ago

A few warnings ...

The tagpdf package, which has been mentioned, could be used to tag LaTeX documents. It does not add tags automatically, however.

Yes. This is explicitly not the purpose of the package. It is not a standard user package. The package has been written to give us (the latex team) and others a tool to investigate and experiment with tagging and to find out which changes in latex are needed.

I don't mind if you try to use it (actually I'm grateful for feedback) but the package is experimental and it is bound to change. For example in the development branch the internal module name has already been changed, for the handling of pdf internals another experimental package is now needed, the handling of artifacts will probably change.

You can get broken pdf if you don't use it correctly (and sometimes if you don't compile often enough to resolve all references). So you need tools to check the validity of the pdf.

For example, to get

\tagstructbegin{tag=H}
\tagmcbegin{tag=H}
\section{Section}
\tagmcend
\tagstructend

Such simple code will normally not work with pdflatex as they can be page breaks in the wrong place resulting in broken pdf. With lualatex it is less problematic.

jgm commented 4 years ago

@u-fischer thanks for the note. Does it work to use etoolbox's \apptocmd and \pretocmd to attach these things?

\pretocmd{\section}{\tagstructbegin{tag=H}\tagmcbegin{tag=H}}{}{}
\apptocmd{\section}{\tagcmdend\tagstructend}{}{}

This could go in the preamble and then the body would not need to change. It seems to me that a similar approach could be used to tag lots of other things, or am I missing something?

u-fischer commented 4 years ago

@jgm that wouldn't change much (apart saving the user some typing). The command are still issued in vertical mode before and after the sectioning. For example if I compile this with pdflatex:

\documentclass{article}
\usepackage{tagpdf}
\tagpdfsetup{activate-all}
\begin{document}
some text 

\section{Section}
text after

\tagstructbegin{tag=H}
\tagmcbegin{tag=H}
\section{Section}
\tagmcend
\tagstructend
text after

\vspace{33\baselineskip}
\tagstructbegin{tag=H}
\tagmcbegin{tag=H}
\section{Section}
\tagmcend
\tagstructend
text after\\text \\text

\vspace{42\baselineskip}
\tagstructbegin{tag=H}
\tagmcbegin{tag=H}
\section{Section}
\tagmcend
\tagstructend
text after\\text \\text

\end{document}

then I get various problems. E.g. wrong spacing after the sections with tagging commands:

image

A page break after the third section and before the following text:

image

and because of the last section preflight reports wrong operators and a faulty pdf:

image

With lualatex the result are better: the pdf is valid and there is no page break after the section.

The side-effects mean that one has to inject the commands into the internal \@startsection instead. And that is what makes the business so complicated: lots of internal code have to be reviewed and reworked to find suitable places for the tagging, at best without breaking existing documents.

klpn commented 4 years ago

Perhaps it is best to aim for lualatex compatibility while working with this in Pandoc? It is possible to use commands like \BeforeBeginEnvironment and \AfterEndEnvironment in etoolbox to tag up environments like frames. But it seems tricky with e.g paragraphs, so maybe some combination with a lua filter for encapsulating these things could work?

klpn commented 4 years ago

As a beginning, I have this file tagged-pres.md

---
title: About PDF converters
author: Karl Pettersson
lang: en
...

# Pixedit

Converts Office files and PDF files to PDF/A.

# Ghostscript

Can convert various files to PDF/A.

I added these lines before \begin{document} in the Pandoc default beamer template, saved as tagged-pres.beamer, and also added \tagstructbegin{tag=Document} after \begin{document} and \tagstructend before \end{document} .

\usepackage{etoolbox}
\BeforeBeginEnvironment{frame}{\tagstructbegin{tag=Sect}%
   \tagstructbegin{tag=H}\tagmcbegin{tag=H}}
\AtBeginEnvironment{frame}{\tagmcend\tagstructend}
\AfterEndEnvironment{frame}{\tagstructend}

\usepackage{tagpdf}
\tagpdfsetup{
 activate-all,
 uncompress,
 tabsorder=structure,
 interwordspace=true
 }

I also have Lua file, tagged-pres.lua.

function tagBlock(label, el)
  return { pandoc.RawBlock("latex", "\\tagmcbegin{tag=" .. label .. "}"),
    el,
    pandoc.RawBlock("latex", "\\tagmcend\n") }
end

function Para(el)
  return tagBlock("P", el)
end

Running pandoc -t beamer -o tagged-pres.pdf tagged-pres.md --lua-filter=tagged-pres.lua --pdf-engine=lualatex --template=tagged-pres.beamer yields a tagged-pres.pdf that looks ok visually, and pdfinfo -struct-text tagged-pres.pdf reveals this structure.

Sect
  H (block)
    ""
    "Pixedit"
    ""
    ""
  "Converts Office files and PDF files to PDF/A."
  Sect
    H (block)
      ""
      "Ghostscript"
      ""
    "Can convert various files to PDF/A."
klpn commented 4 years ago

With these changes in the default Beamer template (generated with pandoc -D beamer > default.beamer in Pandoc 2.8), and the other files and commands as in the comment above, I can also create a PDF/A-2a which validates using the latest veraPDF (1.14.105).

matthewlehew commented 4 years ago

I want to add that this issue is conceptually related to #3177, as the current handling of figures in the Pandoc AST makes it difficult (if not impossible) to create documents that adhere to accessibility standards. Specifically, there is no way to write alt text that is separate from a figure caption.

I use Pandoc to create an open-access textbook, yet I have to convert it to a format where I can manually add alt text to all the captioned figures and then export it to a tagged PDF. So even once this issue is resolved, there is still a significant issue that will keep many from being able to use Pandoc to export accessible PDF documents.

u-fischer commented 4 years ago

@klpn well yes, but you create basically no structure, only the document level and H-header. You don't mark up lists, or figures or tabulars or math or links.

klpn commented 4 years ago

Yes, these examples was just a skeleton, intended to show that at least some tagging could be implemented using etoolbox and lua filters. Of course, it has to be expanded with many more elements to be of any practical use.

klpn commented 4 years ago

Created a new repo to experiment further with this tagging and other PDF accessibility issues.

klpn commented 4 years ago

@matthewlehew It seems that with Lua filters, you can add an /Alt key to e.g. a Figure tag using a custom attribute for an image, which is independent of the figure caption (added a sketch of functionality for this).

klpn commented 4 years ago

I have a tagged PDF produced using make pdfnb from accpdf/tagged-article, which is valid PDF/A-2a, according to veraPDF 1.14.8. It is possible to inspect the tag structure using e.g. pdfinfo or Adobe Acrobat. The tagging commands inserted with the lua filter tend to interfere with other LaTeX commands used by Pandoc, in e.g. tables, and I use some ugly workarounds in the makefile to clean up the resulting LaTeX code. In the future, when there is a LaTeX tagging package intended for production, it will perhaps not have such an interface, which requires explicit tagging commands in the document.

The most robust solution for producing tagged PDFs via Pandoc available today may be via ConTeXt. I have seen several claimed PDF/A-1a produced with MS Word (not from Pandoc-generated DOCX, as far as I know) that fail validation in veraPDF, due to corrupt tag structure. The files should be validated and the tag structure should be inspected, regardless of the PDF engine used.

u-fischer commented 4 years ago

@klpn Is the latex code available somewhere?

klpn commented 4 years ago

Yes, I also uploaded the TeX file which I ran through lualatex 1.10 in order to generate the PDF. To regenerate the PDF from it, you also need the PNG which is used for the chart in the article.

frastlin commented 4 years ago

Prince is what I've been using and it is tagged properly from what I've been able to tell.

klpn commented 4 years ago

Yes, but one downside with Prince for many users (like with ConTeXt, but maybe even worse, because Prince does not use TeX at all, from what I can tell), is that very many templates and packages are designed for LaTeX, so we still need accessible PDFs via LaTeX.

jim0203 commented 4 years ago

I've just tried using --pdf-engine=context for a simple Markdown document that include headers at levels 1, 2, 3, and 4. While each of these was tagged as a header in the resulting PDF, the specific level of each header was disregarded. Instead, each header was tagged at level 1.

Looking forward to a solution to this, I produce a bunch of PDFs for people who use screenreaders, so being able to produce them with pandoc would make my life a lot easier.

klpn commented 4 years ago

You can either use a "HTML-like" structure with H1, H2, H3 or a nested "XML-like" structure with nested Sect and H tags (see e.g the tagpdf maual, sec. 3.3.2). Context seems to use the latter.

---
title: Test header level
---

# Header 1

## Header 2

### Header 3

#### Header 4

Paragraph text

If this is saved as headertest.md, compiling with pandoc -o headertest.pdf headertest.md --pdf-engine=context --metadata pdfa:2a and running pdfinfo -struct-text headertest.pdf yields

Div
  "Test header level"
  Sect "section"
    Div
      H (block)
        "Header 1"
    Div
      Sect "subsection"
        Div
          H (block)
            "Header 2"
        Div
          Sect "subsubsection"
            Div
              H (block)
                "Header 3"
            Div
              Sect "subsubsubsection"
                Div
                  H (block)
                    "Header 4"
                Div
                  "Paragraph text"
jim0203 commented 4 years ago

I can repeat that behaviour. However, when I open the PDF in Windows and try to read it with ZoomText Fusion 2020 (with JAWS, essentially), every header reads as level 1.

When I create a document in Word and then save as PDF using the Microsoft accessibility spec, all of the headers in the PDF work as I would expect them to. Running the Word PDF through pdfinfo returns the following, which looks pretty different to the comparable output from pandoc:

Document
  H1 (block)
    "Header 1 "
  H2 (block)
    "Header 2 "
  H3 (block)
    "Header 3 "
  H4 (block)
    "Header 4 "
  P (block)
    "Paragraph text "

Indeed, you can see how this output actually reflects H1, H2, H3, H4, whereas the pandoc output above just has H, H, H, H. So, while Context does produce the output you describe above, it doesn't actually tag PDFs in a way that makes them readable by a screen reader. I'm not sure how to get the HTML-like structure you describe, but it sounds like what is needed in order to create screen-readable PDFs.

klpn commented 4 years ago

It seems that this would need modification in the Context source code. My accpdf package, which is based on tagpdf, yields leveled headers; however, like tagpdf itself, it is not production ready.

I have the impression that no existing software is really good at producing tagged PDFs; perhaps the need to represent document structure should have been handled in a different way, e.g. by standardizing an embedded HTML or EPUB representation of the content, but there is probably not much we can do about that.

jim0203 commented 4 years ago

Thanks for clarifying. I guess I'll continue using MS Word and saving as PDF from there, with the accessibility option selected. That seems to do the job well. LibreOffice offers a similar option, but I've not tried it so can't say whether it works or not.

klpn commented 4 years ago

Yes, Pandoc can generate DOCX from any supported input format, which can be exported as PDF in Word. As I noted above, even Word may create corrupt tag structure, so it is good idea to always validate the PDFs.

adityam commented 3 years ago

Indeed, you can see how this output actually reflects H1, H2, H3, H4, whereas the pandoc output above just has H, H, H, H. So, while Context does produce the output you describe above, it doesn't actually tag PDFs in a way that makes them readable by a screen reader. I'm not sure how to get the HTML-like structure you describe, but it sounds like what is needed in order to create screen-readable PDFs.

So, isn't it a bug in the screen reader that it is not reading standards compliant PDF?