Open jgm opened 5 years ago
@adityam While using H as structure element with deep nesting is valid pdf, the practice guide (https://www.pdfa.org/resource/tagged-pdf-best-practice-guide-syntax/) recommends not to use it:
Due to a lack of suitable tools, this structure element is impractical, and its use is not recommended
I don't understand why (I'm still learning all of this) but it seems that the pandoc pdf output has a big problem for a screen reader, the spaces between words are missing. You can check it by saving the pdf as text in Acrobat Reader. This happens with LuaLaTeX, XeLaTeX and ConTeX. With PdfLaTeX this problem does not exist, (it happens with some spaces, but not with all spaces).
I don't know if I should open a new issue with this
What is output of poppler pdftotext <filename>.pdf
? Are the files PDF/A-conforming (you can check the metadata with ´pdfinfo` and verify with veraPDF)?
My file is made with the pandoc template, so it is not PDF/A. But even if I use \usepackage[a-2b]{pdfx}
have this problem.
If I use pdftotex there is no problem. Maybe it's an Adobe issue, but it's important because some screen readers seems to use Adobe's conversion to work.
I used your tagged pdf and this pdf does not have this problem.
tagged.pdf is a properly tagged pdf, and it has been made with some lua code which inserts real space chars between words.
I've had some success using @u-fischer's tagpdf package. Try putting this near the top of your template.
\ifluatex
\usepackage[luamode]{tagpdf}
\tagpdfsetup{
activate-all=true,
interwordspace=true
}
\fi
And compile with --pdf-engine=lualatex
.
At least this will help with the spaces-between-words problem.
@jgm But as a warning: tagpdf is experimental, and this is really meant so. The next version will require some new experimental code, which isn't yet compatible with every package and so tagpdf won't work will all documents for some time.
Thanks, but that didn't work for me, sorry. (I used \ifluatex
instead of \iflualatex
).
Don't worry, I see it's not an easy problem, I'll use pdftex for now; and I will wait for the improvements of tagpdf.
Thanks to both of you
I'm not sure if this is is useful example or not, but the web app PAVE uses itext...?
PAVE validates, autotags, and allows manual adding/removal of tags, alt text, reading order and metadata (like setting language and title).
Will support for PDF/UA-1 output be on the cards?
PDF/UA Reference Suite 1.1 – PDF Association Technical Implementation Guide PDF/UA – AIIM
@MarkWahlsten - this thread is specifically about LaTeX output; itext looks like a non-LaTeX PDF generation library, so it's not really relevant here. (Note pandoc already has lots of non-LaTeX routes to PDF; see the different possible arguments for --pdf-engine
. If I recall, some of these already support tagging more fully than the LaTeX route. But the LaTeX route gives the best output, especially if your document contains math, so we'd really like a solution here.)
On pdfa see also #3215 and #5608.
@u-fischer, @jgm I was finally able to solve this spaces-between-words problem by removing the microtype package from the Pandoc's template and ussing
\usepackage[luamode]{tagpdf}
\tagpdfsetup{
activate-all=true,
interwordspace=true
}
microtype package causes some problems in interaction with tagpdf. But it works if you put microtype before than tagpdf.
Thanks again.
We've added a section on PDF accessibility to the manual; it summarizes the currently available methods to produce tagged PDFs.
We've added a section on PDF accessibility to the manual; it summarizes the currently available methods to produce tagged PDFs.
For a number of documents (documents using the standard classes and a restricted set of packages), tagging is readily available and with the june release the number will be growing. User who want to try should have a current LaTeX, use at best lualatex-dev, and start their document with \DocumentMetadata{testphase={phase-III}}
or even (if they are daring) \DocumentMetadata{testphase={phase-III,math}}
. Feedback can be given at https://github.com/latex3/latex2e/discussions/1010 or at the tagpdf github https://github.com/u-fischer/tagpdf.
Just a quick report: after reading the prototype usage instructions I tried to generate a tagged PDF via pandoc and LaTeX and got good results on many documents.
Here's my setup, using Docker images for better reproducability. Dockerfile
:
FROM pandoc/latex:latest
RUN tlmgr install l3experimental pdfmanagement-testphase tagpdf
Built a Docker image with
docker build --tag pandoc/latex:tagging .
and then used it with
docker run --rm -t -v $PWD:/data -u $(id -u):$(id -g) pandoc/latex:tagging \
--pdf-engine=lualatex \
--to=tagged-pdf.lua \
...
The tagged-pdf.lua
just adds the code snippet given in the article to the default template.
Template = [[\DocumentMetadata{
$if(lang)$
lang = $lang$,
$endif$
pdfversion = 2.0,
pdfstandard = ua-2,
pdfstandard = a-4f, %or a-4
testphase =
{phase-III,
title,
table,
math,
firstaid}
}
]] .. pandoc.template.default 'latex'
function Writer(doc, opts)
return pandoc.write(doc, 'latex', opts)
end
This seems to work nicely for many documents, but some docs fail with a message like
! Package tagpdf Error: there is no open structure on the stack
.
Given the moderate success, I think it would be justified to add a snippet like that to the default template if the pdfa
variable is set, similar to what we do for ConTeXt.
I think it would be justified to add a snippet like that to the default template if the pdfa variable is set, similar to what we do for ConTeXt.
Will this cause any problems if you don't have the latest latex3 tech and accessibility packages installed?
Setting the pdfa
variable would then cause a LaTeX compilation failure. So it would have to be well-documented.
if the pdfa variable is set
I wouldn't connect tagging to a general pdfa variable. While there are pdf/A-standards which require also a tagged PDF archivable and accessible are nevertheless two different things. Also the tagging code still errors in various cases and it would be a pain for users who only want pdf/a-2b or similar to have to handle that.
I would suggest some new, dedicated variable, e.g. tagging
. Enabling tagging should really be an explicit choice and not be sneaked into existing user code.
--pdf-engine=lualatex
lualatex-dev
will probably give better results. The system should be as current as possible as this is work in progress.
Just to clarify: the pdfa
variable currently has no effect on LaTeX, it just works with ConTeXt. But I agree that a different variable name might be better.
Yes please use another variable
Could I propose that the standard PDF output should be accessible by default, once the LaTeX project have finished their testing phase?
@tarleb
This seems to work nicely for many documents, but some docs fail with a message like
! Package tagpdf Error: there is no open structure on the stack.
If you have latex examples showing errors you can report them at the tagging-project github.
PDFs produced using latex are not accessible. We could introduce a command-line option that causes the latex writer to include annotations for math (perhaps using the unicode fallback or even raw tex), image alt text, and more: http://ctan.math.washington.edu/tex-archive/macros/latex/contrib/oberdiek/accsupp.pdf
This package also includes an option that makes spaces visible to copy and paste (often when you copy from a latex-compiled PDF, spaces disappear).
Structural elements (paragraphs, lists, etc.) need to be tagged, and reading order indicated.
See also: https://www.tug.org/twg/accessibility/ http://web.science.mq.edu.au/~ross/TaggedPDF/ https://tex.stackexchange.com/questions/124291/revisiting-producing-structured-pdfs-from-latex (with information about using ConTeXT)