Closed jfbu closed 4 months ago
have not idea if the double-dagger comes from some internals of the PDF 1.7 spec or is purely a convenience for tagpdf code style.
The symbol is used in the PDF 2.0 spec, in the table which shows the parent-child rules. I could use something else, but using the same symbol makes it easier to compare my csv-table and the table in the spec.
But as you said, it is not exceptional nowadays to have utf8 in the log, I bet you get a lot if there is some french text in your document. So I don't think that it makes much sense if I avoid its use in tagpdf. Instead you should tell emacs to treat log files always as a byte file (no idea how, but my editor winedt does that, and if it can do it, emacs should be able to do that too).
have not idea if the double-dagger comes from some internals of the PDF 1.7 spec or is purely a convenience for tagpdf code style.
The symbol is used in the PDF 2.0 spec, in the table which shows the parent-child rules. I could use something else, but using the same symbol makes it easier to compare my csv-table and the table in the spec.
understood, I see there are many many occurrences in the parent-child csv
But as you said, it is not exceptional nowadays to have utf8 in the log, I bet you get a lot if there is some french text in your document. So I don't think that it makes much sense if I avoid its use in tagpdf.
The three bytes of the ‡ can show at arbitrary places in log, so it is not only matter of expansion of the three control sequences where it is located originally. Agreed utf8 in log is not exceptional, the problem here is invalid utf8 due to EOL insertion which is less likely to arise from only using sentences in French.
But I agree there is no strong case at all to avoid it in tagpdf. I hesitated opening the ticket. But I did so after about a dozen of times on some period of days where I hit against it wondering what was happening. It will not happen for reasonably sized log files as Emacs will consider it byte encode from seeing isolated byte near (perhaps 1000 lines? not sure) top of log file. Maybe there is a bug in Emacs that once you scroll to taht part of file with the bad split bytes it could re-assert what was the file encoding.
Instead you should tell emacs to treat log files always as a byte file (no idea how, but my editor winedt does that, and if it can do it, emacs should be able to do that too).
I trust this is possible, but then when I use xelatex/lualatex the log will not be seen as nicely as it could be.
Sorry for some noise here.
I apologize in advance. But it took me some time to understand the "issue", which admittedly is only one for me, and even worse, only when involved in looking at trace logs.
tagpdf-data.dtx contains
So when one does
\tracingmacros
with pdflatex one may end up with this non-ascii character in the log file (which is nothing exceptional per see of course). See at bottom of this message the latex3/tagging-project#55 example file with tracing\listoffigures
added.Maybe a hard linewrap is inserted by pdftex like here:
then (this is not the problem yet) Emacs will display the file using
raw-text-unix
(the single bytes are shown as octal sequences above).However, in real life, traces may be and are often huge, like 10 or 20 megabytes. If the tracing were to enclose some text using
\UTFviii@three@octets
from non-ascii letters the octal(342) byte will show soon. But in a test file using none such things, and tracing enclosing\tableofcontents
or\listoffigures
it will come only from the possible splitting at end of line due to maximal length at 79 chars e.g.with the test file below on second compilation the screenshot shows the EOL separated bytes at lines 8820-8821. So Emacs heuristics does not see it (it seems) and it opens the file assigning it "UTF-8" encoding. Then when one tries to save the Emacs buffer it will complain aboututf-8-unix
not able to encode the file.So my question here is whether it is possible to use characters in the ascii range for the task done by the double dagger here. Only for convenience if I continue to stare bewildered at hundreds of thousands of lines of latex code... and want to copy paste some parts to save them as separate files for comparison later...
... the work-around for me is simply to add
near top of file, so Emacs will see isolated bytes (not quite at top because there are many package loading lines) soon enough and not think by mistake the log file is utf-8 encoded.
I have not idea if the double-dagger comes from some internals of the PDF 1.7 spec or is purely a convenience for tagpdf code style.
Test file (compile twice with pdflatex, and then use Emacs (what else?) to visit the log file)