latex3 / tagging-project

Issues related to the tagging project
https://latex3.github.io/tagging-project/
LaTeX Project Public License v1.3c
36 stars 14 forks source link

Process to debug & patch non-standard documentclass tagging #21

Closed dpantele closed 11 months ago

dpantele commented 11 months ago

Current code works marvelously for papers using article/etc. As warned, it does not work well for non-standard classes, such as aa or revtext4. The warnings/problems I see are mosty about unclosed float.

Two examples:

For aa, there is an example https://arxiv.org/abs/2201.00151 , which also uses ps media so I tried xelatex:

% a.tex
\DocumentMetadata{testphase={phase-III,math},pdfversion=2.0,pdfstandard=A-4}
\AddToHook{env/document/end}{
\clearpage
\typeout{tagpdf-mc-data:begin}
\ShowTagging{mc-data}
\typeout{tagpdf-mc-data:end}
}
\AddToHook{env/document/before}{
\usepackage{tagpdf-debug}
\usepackage[depth=10]{bookmark}
\usepackage[crop=off]{auto-pst-pdf}
}
\input{Populations4.tex}
xelatex -interaction=nonstopmode  -output-directory="out" a.tex

There are many warnings like

Package tagpdf Warning: The structure Sect can not be closed.
(tagpdf)                It is not equal to the current structure float on the
(tagpdf)                main stack

and as a result we don't get any StructTree which Acrobat Reader can read.

I am attaching the log for the same file compiled with lualatex: a.log

For revtext, we can even produce an infinite loop in the StructTree for https://arxiv.org/abs/0908.1147

% a.tex
\DocumentMetadata{testphase={phase-III,math},pdfversion=2.0,pdfstandard=A-4}
\AddToHook{env/document/end}{
\clearpage
\typeout{tagpdf-mc-data:begin}
\ShowTagging{mc-data}
\typeout{tagpdf-mc-data:end}
}
\AddToHook{env/document/before}{
\usepackage{tagpdf-debug}
\usepackage[depth=10]{bookmark}
}
\input{FeGapsFinal.tex}

Luatex log: a (1).log, which again contains warnings:

Package tagpdf Warning: Parent-Child 'P/pdf2' --> 'text-unit/latex'.
(tagpdf)                Relation is not allowed (struct 257, /text --> struct
(tagpdf)                278) on line 704

...

Package tagpdf Warning: There are still open structures on the stack!
(tagpdf)                The stack contains
(tagpdf)                {text}{P},{text-unit}{Part},{text}{P},{text-unit}{Part}
,{float}{Aside},{float}{Aside},{float}{Aside},{float}{Aside},{text}{P},{text-un
it}{Part},{text-unit}{Part},{Document}{Document},{Root}{StructTreeRoot}.
(tagpdf)                The structures are automatically closed,
(tagpdf)                but their nesting can be wrong.

Given that there are many unclosed floats, it looks like there is again a problem with closing the float struct.

What is the best way to debug/patch that?

FrankMittelbach commented 11 months ago

I guess that is better made an issue on the tagging-project repo, so I'm going to move it there

u-fischer commented 11 months ago

sorry but do not link to external complicated sources (and actually I can't even access this one). If you find a problem make a small but complete example and post it here.

But beside this: classes like aa or revtex4-2 can't be used currently. They redefine too many relevant LaTeX commands. This can't be repaired with a few patches. These classes can imho only be made tagging aware if their maintainer are involved and are willing to adapt the classes.

As a side-remark: I do not recommend xelatex for tagging. It can't produce real space chars. Use lualatex or pdflatex.

dpantele commented 11 months ago

I can try to extract an example, but first would like to understand if there is a middleground, as being able to analyze accessible pdf would be really nice, to explore the produced structure.

First of all, it seems that content tagging (generating MCID) works pretty well without any patches. Is that expected?

For the structure, a lot is produced right now, so even trying to make sure that structure is balanced would worth it. If we miss sections for the particular style, it should not be a huge deal. Do we still need support from the class authors to resolve errors like There are still open structures on the stack! and The structure Sect can not be closed?

u-fischer commented 11 months ago

First of all, it seems that content tagging (generating MCID) works pretty well without any patches. Is that expected?

With lualatex yes, you normally get always some sane MCID; with pdflatex no, a bad class can produce invalid PDFs here (through wrong nesting or missing EMC).

For the structure, a lot is produced right now

Well it depends. E.g. the aa class you mentioned redefines \enddocument and so looses all relevant hooks. For tagpdf this means that it can't write any structure at all into the PDF. This class completly breaks tagging.

With revtex your chances are a bit better to get some sensible tagging. But to correct the errors without breaking the visual output of the class you need someone who knows the class code and is willing to spent some time on it.

dpantele commented 11 months ago

I see, thanks for the confirmation.

So TLDR is that the process should be as follows: if we can get some small consistent snippet which fails with standard classes, then report it here and try to come up with a hotfix/upstream patch. For non-standard classes I guess we could also report it, but there is less chance this being resolved.

What is the status for the latex (with the intermediate dvi format), which is the only 'proper' way to compile e.g. https://arxiv.org/abs/2201.00151 sample?

u-fischer commented 11 months ago

For non-standard classes I guess we could also report it, but there is less chance this being resolved.

You can naturally try to patch such a class, or write some replacement. But you must be aware that most errors or problems are caused by class specific code so one really must look into the class and understand what it is doing. We are naturally willing to help but we can't fix all classes floating around. If you find problems report them also to the maintainer so that they know that their classes are not compatible.

What is the status for the latex (with the intermediate dvi format),

You shouldn't try to produce tagged pdf with latex+dvips. The tagging itself will probably work (but is not really tested much), but on this route it is not possible to insert real space chars between words and that is not good for accessibility. But if the reason for dvips is only eps-files you can use pdftex instead.

dpantele commented 11 months ago

The document I mentioned above uses .ps includes, and it seems that the only way to get some structure for it is to use the intermediate dvi, because auto-pst-pdf package is very fragile. In any case, that's just a single example, spacing issue makes sense.

Thanks for all the explanations!