latex3 / tagging-project

Issues related to the tagging project
https://latex3.github.io/tagging-project/
LaTeX Project Public License v1.3c
23 stars 5 forks source link

includepdf causes CPU usage to skyrocket (infinite loop?) #86

Closed DavidEGx closed 2 weeks ago

DavidEGx commented 2 weeks ago

I have this latex code:

\DocumentMetadata{
  lang        = en,
  pdfversion  = 2.0,
  pdfstandard = ua-2,
  testphase   = {phase-III, title, table, math, firstaid}
}

\documentclass[10pt,a4paper,notitlepage,twoside,openright]{report}
\usepackage{pdfpages}

\begin{document}

  \includepdf[pages={-},nup=1x1,frame=true]{/home/david/toinclude.pdf}

\end{document}

Notice /home/david/toinclude.pdf is a very simple document.

Then run:

$ xelatex file.tex
Package tagpdf Info: Finalizing the tagging structure:
(tagpdf)             Writing out ~13 structure objects
(tagpdf)             with ~10 'MC' leaf nodes.
(tagpdf)             Be patient if there are lots of objects!

Package tagpdf Info: writing ParentTree
Package tagpdf Info: writing IDTree
Package tagpdf Info: writing RoleMap
Package tagpdf Info: writing ClassMap
Package tagpdf Info: writing NameSpaces
Package tagpdf Info: writing StructElems
Package tagpdf Info: writing Root

It never seems to stop and my CPU goes to 100%.

If I remove phase-III and run it, it works just fine. But obviously no tags.


$ xelatex --version
XeTeX 3.141592653-2.6-0.999996 (TeX Live 2024)
kpathsea version 6.4.0
Copyright 2024 SIL International, Jonathan Kew and Khaled Hosny.
There is NO warranty.  Redistribution of this software is
covered by the terms of both the XeTeX copyright and
the Lesser GNU General Public License.
For more information about these matters, see the file
named COPYING and the XeTeX source.
Primary author of XeTeX: Jonathan Kew.
Compiled with ICU version 74.2; using 74.2
Compiled with zlib version 1.3.1; using 1.3.1
Compiled with FreeType2 version 2.13.2; using 2.13.2
Compiled with Graphite2 version 1.3.14; using 1.3.14
Compiled with HarfBuzz version 8.3.0; using 8.3.0
Compiled with libpng version 1.6.43; using 1.6.43
Compiled with pplib version v2.2
Compiled with fontconfig version 2.13.0; using 2.13.1
$ tlmgr info pdfpages
package:     pdfpages
category:    Package
shortdesc:   Include PDF documents in LaTeX
longdesc:    This package simplifies the inclusion of external multi-page PDF documents in LaTeX documents. Pages may be freely selected and similar to psnup it is possible to put several logical pages onto each sheet of paper. Furthermore a lot of hypertext features like hyperlinks and article threads are provided. The package supports pdfTeX (pdfLaTeX) and VTeX. With VTeX it is even possible to use this package to insert PostScript files, in addition to PDF files.
installed:   Yes
revision:    71386
sizes:       src: 185k, doc: 361k, run: 101k
relocatable: No
cat-version: 0.6a
cat-license: lppl1.3c
cat-topics:  graphics-incl pdf-feat
collection:  collection-latexrecommended
davidcarlisle commented 2 weeks ago

a portable version (which works with lualatex which is generally preferred for tagging) is

\DocumentMetadata{
  lang        = en,
  pdfversion  = 2.0,
  pdfstandard = ua-2,
  testphase   = {phase-III, title, table, math, firstaid}
}

\documentclass[10pt,a4paper,notitlepage,twoside,openright]{report}
\usepackage{pdfpages}

\begin{document}

  \includepdf[pages={-},nup=1x1,frame=true]{example-image-a4-numbered.pdf}

\end{document}

it does seem to loop with xelatex on the first run (it works with xelatex if luatex has written an aux file previously)

u-fischer commented 2 weeks ago

you shouldn't use xelatex for tagging. It can't handle real space chars properly. Use lualatex. But beside this, I can reproduce the bug too, something probably goes wrong in writing the position of the graphic. But if works with includegraphics, so I wonder what pdfpages is doing here.

u-fischer commented 2 weeks ago

I fixed the bug in tagpdf which lead to the loop.

But beside \includepdf is problematic. At first as it calls \includegraphics more than once and so e.g. a simple \includepdf{example-image.pdf} leads to seven figure structures:

image

I can get rid of one of them by adapting the page count command:

\usepackage{l3graphics}
\ExplSyntaxOn
\makeatletter
\def\AM@getpagecount{\graphics_get_pagecount:nN{\AM@currentdocname}\AM@pagecount}
\ExplSyntaxOff

But for the other the help of the pdfpages maintainer is needed ...

The second problem is that if you include a larger document with text in it, you should consider what that means for accessibility: such a document has no structure, it is only a number of larger pictures.

DavidEGx commented 2 weeks ago

Thanks for the answers and the work.

I have a nice pile of legacy code that generate PDFs in a myriad of ways. Dunno if replacing xelatex with lualatex is feasible.

Not all the generated PDFs includepdf so we'd be making some progress here. Even if we don't make it all accessible in the first go, we can iterate later. What we cannot have is servers melting down 😁.

Anyway, couple of questions:

  1. When will be this fix available as part of texlive? (No clue how all the latex tools are glued together).
  2. What tools do you use to inspect the tags? I only found ngpdf.com, but that looks like a different one.
u-fischer commented 2 weeks ago

I have a nice pile of legacy code that generate PDFs in a myriad of ways. Dunno if replacing xelatex with lualatex is feasible.

Try it out. Normally it should not be a problem, and xelatex is really not suited for tagging.

This here e.g. is some simple text with real space chars in lualatex:

image

and here the same with xelatex:

image

I will try to make a tagpdf update this week.

I used adobe pro to check the tags. You can also use pdf Xchange, or the newest pac 2024: https://pac.pdf-accessibility.org/de/herunterladen.

u-fischer commented 2 weeks ago

I uploaded the fix to ctan.

DavidEGx commented 2 weeks ago

I have a nice pile of legacy code that generate PDFs in a myriad of ways. Dunno if replacing xelatex with lualatex is feasible.

Try it out. Normally it should not be a problem, and xelatex is really not suited for tagging.

Thanks. I'll give it a go.

I used adobe pro to check the tags. You can also use pdf Xchange, or the newest pac 2024: https://pac.pdf-accessibility.org/de/herunterladen.

I was thinking on Linux. PAC 2024 seems to work ok(ish) with wine

$ WINEPREFIX=~/.wine32 winetricks dotnet48
$ WINEPREFIX=~/.wine32 wine PAC.exe

However, I get "PDF Header not found" when opening a PDF generated by myself: image

Changed pdfversion to 1.7, generated the file again and the looks good: image

Tried PAC 2024.2.1 BETA and now the 2.0 opens: image

Then it looks fine.

I wonder If I should stick to 1.7 because it is more widely supported or go to 2.0 because it introduces enhancements for accessibility. :thinking:

u-fischer commented 2 weeks ago

Tried PAC 2024.2.1 BETA and now the 2.0 opens:

yes after some pushing they now just started to add support for PDF 2.0. (But they do not get all tests correct.)

I wonder If I should stick to 1.7 because it is more widely supported or go to 2.0 because it introduces enhancements for accessibility

We push and promote 2.0 as it is really needed if you have math in your document (and also for some other things). So the more PDF 2.0 are around (and people complaining if tools do not handle this correctly) the better imho. So I would produce PDF 2.0 unless someone/something forces you to fallback to 1.7.

DavidEGx commented 1 week ago

you shouldn't use xelatex for tagging. It can't handle real space chars properly. Use lualatex.

Tried to switch to lualatex... Found some issues, the main one at the moment is that we use:

usepackage{xeCJK}

That doesn't seem to work with lualatex. This answer recommends to use luatexja-fontspec instead. But that seems to clash with other packages we are using (tabularx) 😥.

Can you explain a bit more on why we shouldn't use xelatex. Is this a you shouldn't use but it is ok ish OR more like you must absolutely avoid xelatex?

u-fischer commented 1 week ago

Can you explain a bit more on why we shouldn't use xelatex. Is this a you shouldn't use but it is ok ish

AsIwrotewithXeLaTeXonecan'tcurrentlyinsertrealspacechars,sofromtheperspectiveofaccessibilitytherearenowordspaces.Decideyourselfifyouwanttoinflictthisonyourusers.

Beside the problem of the spaces: xelatex is regarding tagging quite similar to pdflatex, you have to insert literals/specials everywhere and keep track of the state with labels. That is much less flexible than lualatex where one can use attributes and callbacks to change stuff after the typesetting.

I'm not aware of a clash of luatexja with tabularx, but this can probably be resolved. Make a minimal example that demonstrates the issue and ask e.g. on tex.stackexchange.

FrankMittelbach commented 1 week ago

Can you explain a bit more on why we shouldn't use xelatex. Is this a you shouldn't use but it is ok ish

AsIwrotewithXeLaTeXonecan'tcurrentlyinsertrealspacechars,sofromtheperspectiveofaccessibilitytherearenowordspaces.Decideyourselfifyouwanttoinflictthisonyourusers.

not just "inflict", it simply means you can't produce value PDF/UA file can you? because that is a requirement to have explicit spaces.

DavidEGx commented 1 week ago

Thanks, I guess I was confused because in the pdf I saw the spaces. But I see the spaces within the tags are broken.

I'm not aware of a clash of luatexja with tabularx, but this can probably be resolved. Make a minimal example that demonstrates the issue and ask e.g. on tex.stackexchange.

It actually looks related to tagging, maybe a new issue here?

\DocumentMetadata{
  lang        = en,
  pdfversion  = 2.0,
  pdfstandard = ua-2,
  testphase   =
   {phase-III,
    table,
    math,
    firstaid}
}
\documentclass[10pt,a4paper]{report}
\usepackage{luatexja-fontspec}
\usepackage{tabularx}

\begin{document}
  \begin{tabularx}{\textwidth}{|X|X|}
  hello & hola
  \end{tabularx}
\end{document}
$ lualatex latexdoc.tex
(./latexdoc.aux) (/opt/texlive/2024/texmf-dist/tex/latex/base/ts1cmr.fd)
Info: mathml file latexdoc-mathml does not exist
! Argument of \__math_grab_dollar:w has an extra }.
<inserted text> 
\par 
l.18   \end{tabularx}

? 

Any of these fixes it:

(I can probably carry on removing "math" myself)

u-fischer commented 1 week ago

Removing the math will avoid the error, but the tagging of the table is broken nevertheless. (You get warnings like Package tagpdf Warning: Parent-Child 'P/pdf2' --> 'TR/pdf2'. in the log and that means something is not right.)

Basically luatexja is currently not compatible as it overwrites internal tabular commands and so removes the tagging code. They should either remove the patches or adapt them to the new kernel code. I will open a new issue to track that.