ho-tex / pdftexcmds

pdftexcmds package for LaTeX
LaTeX Project Public License v1.3c
1 stars 0 forks source link

Dropping byte handling of UTF-8 strings breaks pdfx #3

Closed amyspark closed 8 months ago

amyspark commented 4 years ago

Dear Mr Carlisle,

In 14caf12a8f4aa0e40a2de32eab2a19da8f7a29a0, byte handling was removed from the string escaping primitive of LuaTeX. This breaks down pdfx, it now outputs mojibake in place of extended ASCII characters:

\documentclass{article}
%\usepackage[pdfencoding=auto,pdfa,pdfversion=1.4]{hyperref}
\usepackage[a-1b]{pdfx}
\listfiles

\begin{document}
    \section{holá}
\end{document}
Captura de Pantalla 2019-12-06 a la(s) 17 41 33

I reported this bug to the maintainer of pdfx initially, and he figured out the issue was in your package.

Could you have a look at this?

u-fischer commented 4 years ago

\pdf@escapestring has been changed so that it gives something sensible at all with luatex with non-ascii. It looks as if the pdfescape package needs to be adapted to this changes.

But pdfx shouldn't set pdfencoding=auto for luatex. The recommended encoding (and the default) is unicode.

\documentclass{article}
\usepackage{hyperref}
\hypersetup{pdfencoding=auto} %breaks ...
\begin{document}
\section{holá}
\end{document}
ozross commented 4 years ago

Thanks Ulrike. I'll look at this.

ozross commented 4 years ago

OK. There is definitely a slight issue here, involving some discrepancy between the results from different engines.

With the example as given:

\documentclass{article}
\usepackage{hyperref}
\hypersetup{pdfencoding=auto}
\begin{document}
\section{holá}
\end{document}

using pdftexcmds.sty 2019/11/24 v0.31 the contents of the .out file become, from pdfLaTeX:

\BOOKMARK [1][-]{section.1}{hol\341}{}% 1

from XeLaTeX:

\BOOKMARK [1][-]{section.1}{holá}{}% 1

from LuaLaTeX:

\BOOKMARK [1][-]{section.1}{\376\377\000h\000o\000l\000\341}{}% 1

In the 3 cases, a 2nd run with the same engine shows the bookmark correctly, as holá; but if you run with pdfTeX or XeTeX first, then do a run with LuaTeX you get mojibake for the bookmark. OK; that's a situation that is not likely in practice, unless experimenting.

However, with pdftexcmds.sty 2018/09/10 v0.29 the LuaLaTeX result (i.e., the .out file contents) is identical to with pdfLaTeX, and mostly shows correctly in the bookmark. Only if a XeTeX run is followed by a LuaTeX run is there mojibake.

Is the above all as is intended?

Now suppose we just use \usepackage{hyperref} without any mention of pdfencoding. It seems that it is even harder to get mojibake. (Only XeTeX followed by either pdfTeX or LuaTeX.) The default is unicode, right?

So what is the real purpose of using pdfencoding=auto ? In what sense is it `automatic' ? Why would one ever need it, unless perhaps working with old files having (non-UTF) legacy encodings?

u-fischer commented 4 years ago

So what is the real purpose of using pdfencoding=auto ?

pdfencoding=auto analyzes the string and then uses either pdfdoc or unicode in the bookmarks.

Try (with lualatex)

\documentclass{article}
\usepackage{hyperref}
\hypersetup{pdfencoding=auto}
\begin{document}
\section{holá}
\section{holáα}
\end{document}

This gives

\BOOKMARK [1][-]{section.1}{hol\303\241}{}% 1
\BOOKMARK [1][-]{section.2}{\376\377\000h\000o\000l\000\341\003\261}{}% 2

Why would one ever need it, unless perhaps working with old files having (non-UTF) legacy encodings?

It makes the pdf a bit smaller, but beside this it makes the code complicated and we are considering to remove the option (and pdfdoc encoding) and stick to unicode.

ozross commented 4 years ago

Interesting example. The bookmarks work fine, but only pdfLaTeX (TeX Live 2019) gives a message that the α character is not setup for use (i.e., no encoding support, as well as no font to show it). Both XeLaTeX and LuaLaTeX just accept the input quietly despite not showing the character. This is nothing to do with hyperref, but shouldn't LaTeX be being more vocal?

davidcarlisle commented 4 years ago

@ozross latex can not trap missing characters (well I suppose it could in lua) but luatex and xetex both act as classic tex and make missing characters an essentially non configurable message in the log, so for luatex you get a log of

Missing character: There is no α (U+03B1) in font [lmroman12-bold]:+tlig;!

and in xetex

Missing character: There is no α in font [lmroman12-bold]:mapping=tex-text;!

You see essentially the same from classic tex for missing characters.

ozross commented 4 years ago

Yes, they are messages in the .log file, but not in the Console window (which is less noisy, at least on MacOS). Thus such messages can very easily go unnoticed.

In pdfLaTeX, on the other hand, the message comes from coding in inputenc.sty :

./book-bake.tex:9: Package inputenc Error: Unicode character α (U+03B1)
(inputenc)                not set up for use with LaTeX.

See the inputenc package documentation for explanation.
Type  H <return>  for immediate help.
 ...                                              

l.9 \section{holáα}

? 

which really puts it into your face. It'd be nice to have something similar (though need not be as dramatic) in the other engines.

davidcarlisle commented 4 years ago

no that is completely different, try just using a latin1 file with á in plain pdftex and see the same behaviour.

the error you show from inputenc is not font related at all and you will get it whether or not the font has the character, it relates to the lack of a mapping from the active characters to the font setup.

You can not make every characater active in xetex/luatex as then you would only be able to have asciii command names (and it would be much slower)

ozross commented 4 years ago

Yes, I understand that it is happening at a different stage of processing, and due to a different characterisation of what is a character. With the move to handling full unicode input, it seems that conceptually there ought to be a way of alerting a user to the fact that some non-whitespace characters from the input have not produced visible output on the PDF page (despite having passed all the way through the typesetting machinery). It doesn't have to be feedback as each such character is encountered – TeX does that already. Maybe just a single warning or error message at the end is enough. An extra message from the engine itself might be the best way to do it, if LaTeX processing cannot detect the appropriate tokens or conditions?

Opening up the side panel to get this (see image) just seems rather strange:

Screen Shot 2019-12-10 at 12 42 44 pm
davidcarlisle commented 4 years ago

It would need to be an engine feature, xetex is just following classic tex here and sending to the log. In practice the console/log distinction is not as important as it used to be as most people use a tex ide and most of them hide the console and then show a filtered view of the log, so you'd need to put in a request to texworks etc to include the missing character information in the warning summary.

One change coming up (if not there already) is that otfload will (I think) show missing character glyphs rather than nothing in the typeset output

u-fischer commented 4 years ago

You can use \tracinglostchars=2 then the messages are also in the terminal.

Missing glyph do show up(in xelatex and a current lualatex) - but how depends on the font. Some have a dedicated /.notdef glyph:

image

but the gyre fonts (including latin modern) show only a space:

image

ozross commented 4 years ago

So the difference between 1 and 2 for \tracinglostchars affects where the message is written. Very simple; but I'd not have guessed it. Thanks again Ulrike. And the missing character symbol certainly helps; but as you say, it's not always visible.

u-fischer commented 8 months ago

closing as the bookmarks looks fine with a current lualatex