Closed amyspark closed 8 months ago
\pdf@escapestring
has been changed so that it gives something sensible at all with luatex with non-ascii. It looks as if the pdfescape package needs to be adapted to this changes.
But pdfx shouldn't set pdfencoding=auto
for luatex. The recommended encoding (and the default) is unicode.
\documentclass{article}
\usepackage{hyperref}
\hypersetup{pdfencoding=auto} %breaks ...
\begin{document}
\section{holá}
\end{document}
Thanks Ulrike. I'll look at this.
OK. There is definitely a slight issue here, involving some discrepancy between the results from different engines.
With the example as given:
\documentclass{article}
\usepackage{hyperref}
\hypersetup{pdfencoding=auto}
\begin{document}
\section{holá}
\end{document}
using pdftexcmds.sty 2019/11/24 v0.31
the contents of the .out
file become,
from pdfLaTeX:
\BOOKMARK [1][-]{section.1}{hol\341}{}% 1
from XeLaTeX:
\BOOKMARK [1][-]{section.1}{holá}{}% 1
from LuaLaTeX:
\BOOKMARK [1][-]{section.1}{\376\377\000h\000o\000l\000\341}{}% 1
In the 3 cases, a 2nd run with the same engine shows the bookmark correctly, as holá
;
but if you run with pdfTeX or XeTeX first, then do a run with LuaTeX you get mojibake for the bookmark. OK; that's a situation that is not likely in practice, unless experimenting.
However, with pdftexcmds.sty 2018/09/10 v0.29
the LuaLaTeX result (i.e., the .out
file contents) is identical to with pdfLaTeX, and mostly shows correctly in the bookmark.
Only if a XeTeX run is followed by a LuaTeX run is there mojibake.
Is the above all as is intended?
Now suppose we just use \usepackage{hyperref}
without any mention of pdfencoding
.
It seems that it is even harder to get mojibake.
(Only XeTeX followed by either pdfTeX or LuaTeX.)
The default is unicode
, right?
So what is the real purpose of using pdfencoding=auto
?
In what sense is it `automatic' ?
Why would one ever need it, unless perhaps working with old files having (non-UTF) legacy encodings?
So what is the real purpose of using
pdfencoding=auto
?
pdfencoding=auto analyzes the string and then uses either pdfdoc or unicode in the bookmarks.
Try (with lualatex)
\documentclass{article}
\usepackage{hyperref}
\hypersetup{pdfencoding=auto}
\begin{document}
\section{holá}
\section{holáα}
\end{document}
This gives
\BOOKMARK [1][-]{section.1}{hol\303\241}{}% 1
\BOOKMARK [1][-]{section.2}{\376\377\000h\000o\000l\000\341\003\261}{}% 2
Why would one ever need it, unless perhaps working with old files having (non-UTF) legacy encodings?
It makes the pdf a bit smaller, but beside this it makes the code complicated and we are considering to remove the option (and pdfdoc encoding) and stick to unicode.
Interesting example. The bookmarks work fine, but only pdfLaTeX (TeX Live 2019) gives a message that the α character is not setup for use (i.e., no encoding support, as well as no font to show it). Both XeLaTeX and LuaLaTeX just accept the input quietly despite not showing the character. This is nothing to do with hyperref, but shouldn't LaTeX be being more vocal?
@ozross latex can not trap missing characters (well I suppose it could in lua) but luatex and xetex both act as classic tex and make missing characters an essentially non configurable message in the log, so for luatex you get a log of
Missing character: There is no α (U+03B1) in font [lmroman12-bold]:+tlig;!
and in xetex
Missing character: There is no α in font [lmroman12-bold]:mapping=tex-text;!
You see essentially the same from classic tex for missing characters.
Yes, they are messages in the .log
file, but not in the Console window (which is less noisy, at least on MacOS). Thus such messages can very easily go unnoticed.
In pdfLaTeX, on the other hand, the message comes from coding in inputenc.sty
:
./book-bake.tex:9: Package inputenc Error: Unicode character α (U+03B1)
(inputenc) not set up for use with LaTeX.
See the inputenc package documentation for explanation.
Type H <return> for immediate help.
...
l.9 \section{holáα}
?
which really puts it into your face. It'd be nice to have something similar (though need not be as dramatic) in the other engines.
no that is completely different, try just using a latin1 file with á in plain pdftex and see the same behaviour.
the error you show from inputenc is not font related at all and you will get it whether or not the font has the character, it relates to the lack of a mapping from the active characters to the font setup.
You can not make every characater active in xetex/luatex as then you would only be able to have asciii command names (and it would be much slower)
Yes, I understand that it is happening at a different stage of processing, and due to a different characterisation of what is a character. With the move to handling full unicode input, it seems that conceptually there ought to be a way of alerting a user to the fact that some non-whitespace characters from the input have not produced visible output on the PDF page (despite having passed all the way through the typesetting machinery). It doesn't have to be feedback as each such character is encountered – TeX does that already. Maybe just a single warning or error message at the end is enough. An extra message from the engine itself might be the best way to do it, if LaTeX processing cannot detect the appropriate tokens or conditions?
Opening up the side panel to get this (see image) just seems rather strange:
It would need to be an engine feature, xetex is just following classic tex here and sending to the log. In practice the console/log distinction is not as important as it used to be as most people use a tex ide and most of them hide the console and then show a filtered view of the log, so you'd need to put in a request to texworks etc to include the missing character information in the warning summary.
One change coming up (if not there already) is that otfload will (I think) show missing character glyphs rather than nothing in the typeset output
You can use \tracinglostchars=2
then the messages are also in the terminal.
Missing glyph do show up(in xelatex and a current lualatex) - but how depends on the font. Some have a dedicated /.notdef glyph:
but the gyre fonts (including latin modern) show only a space:
So the difference between 1 and 2 for \tracinglostchars
affects where the message is written. Very simple; but I'd not have guessed it. Thanks again Ulrike.
And the missing character symbol certainly helps; but as you say, it's not always visible.
closing as the bookmarks looks fine with a current lualatex
Dear Mr Carlisle,
In 14caf12a8f4aa0e40a2de32eab2a19da8f7a29a0, byte handling was removed from the string escaping primitive of LuaTeX. This breaks down
pdfx
, it now outputs mojibake in place of extended ASCII characters:I reported this bug to the maintainer of
pdfx
initially, and he figured out the issue was in your package.Could you have a look at this?