latex3 / tagpdf

Tagging support code for LaTeX
63 stars 9 forks source link

Character `‡` causes problems with text editors when trace gives very big log file #94

Closed jfbu closed 4 months ago

jfbu commented 8 months ago

I apologize in advance. But it took me some time to understand the "issue", which admittedly is only one for me, and even worse, only when involved in looking at trace logs.

tagpdf-data.dtx contains

\prop_const_from_keyval:Nn \c_@@_role_rules_prop
 {
    0..n = 1,
    0..1 = 2,
    1    = 3, %StructTreeRoot, not really needed
    [a]  = 4, %ruby
    [b]  = 5, %warichu
    c    = 6, % WP ??
    ‡    = 7, % Part,Div,NonStruct -> "check parent"
    ∅*   = 8, % or negative by default?
    ∅    = -1,
 }

So when one does \tracingmacros with pdflatex one may end up with this non-ascii character in the log file (which is nothing exceptional per see of course). See at bottom of this message the latex3/tagging-project#55 example file with tracing \listoffigures added.

Maybe a hard linewrap is inserted by pdftex like here:

Capture d’écran 2024-01-11 à 17 14 00

then (this is not the problem yet) Emacs will display the file using raw-text-unix (the single bytes are shown as octal sequences above).

However, in real life, traces may be and are often huge, like 10 or 20 megabytes. If the tracing were to enclose some text using \UTFviii@three@octets from non-ascii letters the octal(342) byte will show soon. But in a test file using none such things, and tracing enclosing \tableofcontents or \listoffigures it will come only from the possible splitting at end of line due to maximal length at 79 chars e.g.with the test file below on second compilation the screenshot shows the EOL separated bytes at lines 8820-8821. So Emacs heuristics does not see it (it seems) and it opens the file assigning it "UTF-8" encoding. Then when one tries to save the Emacs buffer it will complain about utf-8-unix not able to encode the file.

So my question here is whether it is possible to use characters in the ascii range for the task done by the double dagger here. Only for convenience if I continue to stare bewildered at hundreds of thousands of lines of latex code... and want to copy paste some parts to save them as separate files for comparison later...

... the work-around for me is simply to add

\def\foo{‡}
\tracingmacros1
\foo
\tracingmacros0

near top of file, so Emacs will see isolated bytes (not quite at top because there are many package loading lines) soon enough and not think by mistake the log file is utf-8 encoded.

I have not idea if the double-dagger comes from some internals of the PDF 1.7 spec or is purely a convenience for tagpdf code style.

Test file (compile twice with pdflatex, and then use Emacs (what else?) to visit the log file)

\DocumentMetadata{
 uncompress,
 pdfversion=1.7,
 lang=en-US,
 testphase=phase-III
}

\documentclass{article}

\begin{document}

\tracingmacros1
\listoffigures
\tracingmacros0

\section{foo}
\begin{figure}[htbp]
  \centering
  hello
  \caption{hello}
\end{figure}
\end{document}
u-fischer commented 8 months ago

have not idea if the double-dagger comes from some internals of the PDF 1.7 spec or is purely a convenience for tagpdf code style.

The symbol is used in the PDF 2.0 spec, in the table which shows the parent-child rules. I could use something else, but using the same symbol makes it easier to compare my csv-table and the table in the spec.

But as you said, it is not exceptional nowadays to have utf8 in the log, I bet you get a lot if there is some french text in your document. So I don't think that it makes much sense if I avoid its use in tagpdf. Instead you should tell emacs to treat log files always as a byte file (no idea how, but my editor winedt does that, and if it can do it, emacs should be able to do that too).

jfbu commented 8 months ago

have not idea if the double-dagger comes from some internals of the PDF 1.7 spec or is purely a convenience for tagpdf code style.

The symbol is used in the PDF 2.0 spec, in the table which shows the parent-child rules. I could use something else, but using the same symbol makes it easier to compare my csv-table and the table in the spec.

understood, I see there are many many occurrences in the parent-child csv

But as you said, it is not exceptional nowadays to have utf8 in the log, I bet you get a lot if there is some french text in your document. So I don't think that it makes much sense if I avoid its use in tagpdf.

The three bytes of the ‡ can show at arbitrary places in log, so it is not only matter of expansion of the three control sequences where it is located originally. Agreed utf8 in log is not exceptional, the problem here is invalid utf8 due to EOL insertion which is less likely to arise from only using sentences in French.

But I agree there is no strong case at all to avoid it in tagpdf. I hesitated opening the ticket. But I did so after about a dozen of times on some period of days where I hit against it wondering what was happening. It will not happen for reasonably sized log files as Emacs will consider it byte encode from seeing isolated byte near (perhaps 1000 lines? not sure) top of log file. Maybe there is a bug in Emacs that once you scroll to taht part of file with the bad split bytes it could re-assert what was the file encoding.

Instead you should tell emacs to treat log files always as a byte file (no idea how, but my editor winedt does that, and if it can do it, emacs should be able to do that too).

I trust this is possible, but then when I use xelatex/lualatex the log will not be seen as nicely as it could be.

Sorry for some noise here.