Japanese characters in metadata on (u)pLaTeX

aminophen commented 3 years ago

I'm not sure this is the right place, sorry.

After loading \RequirePackage{pdfmanagement-testphase}, adding Japanese characters into PDF metadata does not work at all on (u)pLaTeX + dvipdfmx.

upLaTeX + dvipdfmx

Nowadays we use the following syntax.

\documentclass[dvipdfmx]{ujarticle}
\usepackage{hyperref}
\usepackage{pxjahyper}
\hypersetup{pdftitle={日本語}}
\begin{document}
\section{はじめに}
あいうえお。
\end{document}

If we add

\RequirePackage{pdfmanagement-testphase}
\DeclareDocumentMetadata
  {
    backend = dvipdfmx
  }

then an error happens:

! LaTeX3 Error: Invalid UTF-8 string: missing continuation byte (x3).

It seems that "l3str-convert" does not support Japanese characters. (maybe similar to https://github.com/latex3/latex3/issues/939; you should consider better handing of Japanese tokens; just pass it as-is, literally untouched, is ok for us.)

pLaTeX + dvipdfmx

Nowadays we use the following syntax.

\documentclass[dvipdfmx]{jarticle}
\usepackage{hyperref}
\usepackage{pxjahyper}
\hypersetup{pdftitle={日本語}}
\begin{document}
\section{はじめに}
あいうえお。
\end{document}

If we add

\RequirePackage{pdfmanagement-testphase}
\DeclareDocumentMetadata
  {
    backend = dvipdfmx
  }

then an error happens:

! Package pxjahyper Error: The hyperref 'unicode' mode is not supported
(pxjahyper)                on the pTeX engine.

The error itself is natural enough, since pLaTeX engine cannot handle Unicode at all. Therefore, "hyperref" should not enable "unicode" mode on pLaTeX.

FYI: What the "pxjahyper" package does?

When using (u)pLaTeX, we need to give the correct "encoding conversion rule" to dvipdfmx by using pdf:tounicode special. The rule is provided in the ToUnicode CMap developed by Adobe or other contributers (for upLaTeX "UTF8-UTF16" or "UTF8-UCS2", for pLaTeX "EUC-UCS2" or "90ms-RKSJ-UCS2"), but we need to select the correct one depending on "which engine is running (pLaTeX or upLaTeX)" and "which encoding is used (mostly Shift-JIS on win32 / EUC-JP on Unix)". The "pxjahyper" package automatically does it.

aminophen commented 3 years ago

Handling Japanese characters on (u)pTeX is a big deal, due to complicated traditional encoding rules. It is almost impossible to support such a complicated conversion rule properly, even for us Japanese, without knowing the historical reasons. So, we recommend you "just leave it to us, without any conversion." The halfway conversion makes things more complicated, so pass it literally to us!

u-fischer commented 3 years ago

I'm not sure this is the right place, sorry.

It's fine, one can always move if needed.

It seems that "l3str-convert" does not support Japanese characters.

yes I guess it this problem, so we will have to solve that first.

The error itself is natural enough, since pLaTeX engine cannot handle Unicode at all. Therefore, "hyperref" should not enable "unicode" mode on pLaTeX.

Well the generic driver used by hyperref if the pdfmanagement is used is "generic", that means it doesn't contain any engine tests. And yes it force the unicode option. But the option mostly declares how hyperref writes out strings to the pdf or dvi, so I'm not sure why ptex cares about it.

Side question: which input encoding is used with ptex? utf8 or something else?

aminophen commented 3 years ago

But the option mostly declares how hyperref writes out strings to the pdf or dvi, so I'm not sure why ptex cares about it.

The check is done by the pxjahyper package (not by pTeX), as the package knows it cannot work as expected when the strings are written out in unicode.

Side question: which input encoding is used with ptex? utf8 or something else?

It depends on users ;-) I think many users are writing in UTF-8 these days, but in some situations people still choose to write in Shift-JIS or EUC-JP for historical reasons. The pTeX engine accepts all of these encodings, by using -kanji=<enc> runtime option.

josephwright commented 3 years ago

@u-fischer Move to the latex3 repo?

u-fischer commented 3 years ago

The check is done by the pxjahyper package (not by pTeX), as the package knows it cannot work as expected when the strings are written out in unicode.

But strings are not written out "in unicode". I mean all in the PDF is ascii. hyperref writes out strings like (\376\377\000g\000r\000\374\000\337\000e). They may look curious, but they are ascii, so why does pTeX chokes on them?

u-fischer commented 3 years ago

@josephwright I think there is already an issue about str-convert in latex3, so this can stay here for now.

FrankMittelbach commented 3 years ago

But strings are not written out "in unicode". I mean all in the PDF is ascii. hyperref writes out strings like (\376\377\000g\000r\000\374\000\337\000e). They may look curious, but they are ascii, so why does pTeX chokes on them?

because they still are unicode only written in octal or not? Point is (I may be mistaken) if ptex doesn't support unicode but Shift-JIS or EUC-JP it probably translates to that on input, i.e. it is like inputenc translates to some LICR internally and then works with that. But your output isn't Shift-JIS it is unicode and so it chokes ... my rough guess

u-fischer commented 3 years ago

@FrankMittelbach but I'm writing that out in specials, so only the driver sees it. And this is dvips or something like that. Isn't it?

FrankMittelbach commented 3 years ago

maybe because of this?

When using (u)pLaTeX, we need to give the correct "encoding conversion rule" to dvipdfmx by using pdf:tounicode special. The rule is provided in the ToUnicode CMap developed by Adobe or other contributers (for upLaTeX "UTF8-UTF16" or "UTF8-UCS2", for pLaTeX "EUC-UCS2" or "90ms-RKSJ-UCS2"), but we need to select the correct one depending on "which engine is running (pLaTeX or upLaTeX)" and "which encoding is used (mostly Shift-JIS on win32 / EUC-JP on Unix)". The "pxjahyper" package automatically does it.

the Japanese char ends up in your octals and so the to UnicodeMapping stops working? (just guessing)

davidcarlisle commented 3 years ago

@u-fischer, @aminophen wil correct me if I'm wrong, but I think the old model for ptex was that the specials contained shift-jis or whatever in the special and the dvi driver did the conversion to Unicode in the final stage. Given that \ is Yen in Shift-jis I would guess things go badly wrong if \-quoted octal utf-8 gets interpreted that way.

aminophen commented 3 years ago

the old model for ptex was that the specials contained shift-jis or whatever in the special and the dvi driver did the conversion to Unicode in the final stage.

Correct.

Given that \ is Yen in Shift-jis I would guess things go badly wrong if \ -quoted octal utf-8 gets interpreted that way.

"\ looks like Yen in shift-jis" is unrelated. The problem here is that "how the octal should be decoded." (using shift-jis? or euc-jp? or utf-8?). The encoding is pre-determined by (u)pTeX engine (not by a macro layer), so you cannot disable such a encoding conversion. Instead, you have to know "how (u)pTeX engine will encode it" and keep consistency with it.

u-fischer commented 3 years ago

@aminophen

I think I can remove the unicode settings from the generic driver as pdftex use now unicode by default anyway. This will avoid the one error (! Package pxjahyper Error: The hyperref 'unicode' mode is not supported).

It will not solve everything as the driver doesn't always use hyperref commands for the conversion, so one will have to check if there are more wrong conversions somewhere, but this requires at first that l3str-convert works correctly with japanese.

aminophen commented 3 years ago

Unfortunately, Japanese devs found out that "it is almost impossible to support Japanese characters within the current behavior of pTeX, as long as l3str-convert uses \tl_to_str:n (= \detokenize) first."

As described in TUGboat article "Distinguishing 8-bit characters and Japanese characters in (u)pTeX" by H.Kitagawa, pTeX becomes confused among Latin and Japanese character tokens during "stringization". The "stringization" procedure occurs in \meaning, \message and \detokenize and related primitives, thus it becomes impossible to distinguish between Latin and Japanese characters after the token is processed by \detokenize. The proposed improvement of pTeX is quite reasonable, but the patches are so huge that they are not fully tested yet ...

The current behavior is considered unnatural today; however, at the time pTeX was designed and developed, Unicode 1.0 didn't even exist and 8-bit character inputs were rare, so the problem had never been exposed at all for years. Changing the behavior at this stage needs design change and requires lots of testing, so it would take much time.

u-fischer commented 3 years ago

Unfortunately, Japanese devs found out that "it is almost impossible to support Japanese characters within the current behavior of pTeX, as long as l3str-convert uses \tl_to_str:n (= \detokenize) first."

The question is if and how one can sanitize the input first. That means if you get an input like

\tl_set:Nx \l_tmpa_tl {\text_purify:n{abcはじめにgrüße}}

can you convert \l_tmpa_tl into a form that one can safely fed to str_convert?

As described in TUGboat article "Distinguishing 8-bit characters and Japanese characters in (u)pTeX"

well I would say that while one can look for macro solutions, engine changes are really needed here ;-) But we quite understand that large changes aren't done fast.

aminophen commented 3 years ago

The question is if and how one can sanitize the input first. That means if you get an input like \tl_set:Nx \l_tmpa_tl {\text_purify:n{abcはじめにgrüße}} can you convert \l_tmpa_tl into a form that one can safely fed to str_convert?

I don't think one can sanitize such an input easily, sorry.

Theoretically it may be possible to delete all Japanese characters first and then restore, as a "temporary workaround" to cope with the current (u)pTeX behavior. However, such a manipulation would require a deep understanding of how (u)pTeX treats Japanese character tokens. We (Japanese developers) are the only ones who can provide that information, but our human resources to do that are limited and will overlap with those to examine/improve and conduct whole testing of (u)pTeX engine. Therefore, we'd like to concentrate on the improvement of (u)pTeX engine, rather than devising a temporary workaround.

All we can hope is that you do not make l3str-convert code into default, at east for (u)pLaTeX, until we manage to release an "improved" version of (u)pTeX in TL2022 or something.

u-fischer commented 3 years ago

All we can hope is that you do not make l3str-convert code into default,

I don't think that there is a large problem, even if ptex user wants to try the pdfmanagement code. It uses l3str-convert code in various places, but mostly for text for which a sensible user would use ascii like filenames or destination names.

bookmarks still use the hyperref commands, and if we implement something new here it wouldn't be difficult to add an option to fallback to the older code.

So from a practical point it remains your example at the start: handling pdftitle and pdfauthor.

Can you provide some code so which shows how you would write something into the Info dictionary with the primitive \pdfinfo?

aminophen commented 3 years ago

I don't think that there is a large problem, even if ptex user wants to try the pdfmanagement code. It uses l3str-convert code in various places, but mostly for text for which a sensible user would use ascii like filenames or destination names.

it’s true that a sensible user would use ascii for filenames or destinations, but now we don't know all the places where l3 will be used already or in the future, so I cannot tell whether your point is safe enough.

bookmarks still use the hyperref commands, and if we implement something new here it wouldn't be difficult to add an option to fallback to the older code.

something like \usepackage[strconvert=2e]{hyperref} or something would suffice; if you could provide a way to simply fallback to the old code, then we can extend pxjahyper to enable the strconvert=2e option for pTeX. In this scenario, you will not need any knowledge about Japanese tokens as the old code does no harm for us.

u-fischer commented 3 years ago

it’s true that a sensible user would use ascii for filenames or destinations, but now we don't know all the places where l3 will be used already or in the future, so I cannot tell whether your point is safe enough.

I only meant that it is safe enough for the near future. I don't think that it is safe long term: non-ascii is used increasingly in places where traditionally only ascii was used, like file names, command names, label names, url's, verbatim content like code listings and so on. And that means that solutions that work ok if you only want to print non-ascii are no longer sufficient—and this doesn't refer only to ptex. 8-bit file encodings for example can get problematic too.

Imho it is quite important that the updated ptex engines are made available as fast as possible so that tests can be done with them.

aminophen commented 3 years ago

I only meant that it is safe enough for the near future. I don't think that it is safe long term:

I know it, but pTeX would not be able to fully support non-ascii by design, even after the "improvement" of engine is done. Actually it has been noted that pLaTeX cannot process some Latin documents due to the incompatible design feature of pTeX regarding reading bytes. So, we don't hope for full support for non-ascii; only we need is that there is no regression compared to the current behavior.

OTOH upTeX has the full potential, as it has an enhanced design of kcatcode storing compared to pTeX.

it is quite important that the updated ptex engines are made available as fast as possible so that tests can be done with them.

We'll look into it. Fingers crossed...

zr-tex8r commented 3 years ago

（I am the author of the pxjahyper package.)

indeed we do think that l3str-convert (as well as other expl3 features) must support e-(u)pTeX in the future (unless we totally abondon e-pTeX and/or e-upTeX,) and we have already started to act for that. The survey of critical isuues is the first step.

Another thing to note. The “old way” of hyperref + pjahyper works fine, but it heavily depends on hyperref’s inputenc-fontenc conversion chain. That conversion will handle non-CJK non-ASCII letters in the same way as pdfTeX. Thus I think that the most reasonable way to support e-(u)pTeX for the present would be to use \pdfstringdef (as is) for fallback if (u)pLaTeX is used.

u-fischer commented 3 years ago

I would suggest that for (u)pLaTeX you do for now something like this if the pdfmanagement is detected:

\RequirePackage{pdfmanagement-testphase}
\DeclareDocumentMetadata{uncompress,backend = dvipdfmx}

\documentclass[dvipdfmx]{ujarticle}
\usepackage{hyperref}
\ExplSyntaxOn
\keys_define:nn { hyp / setup }
  {
     pdftitle  .code:n =
       {
         \tl_if_blank:nTF {#1}
           {
             \pdfmanagement_remove:nn {Info}{Title}
           }
           {
             \pdfstringdef\l_tmpa_tl{#1}
             \pdfmanagement_add:nnx {Info}{Title}{(\l_tmpa_tl)}
           }   
       }
  }    
\ExplSyntaxOff
\usepackage{pxjahyper}
\hypersetup{pdftitle={日本語}}
\begin{document}
\section{はじめに}
あいうえお。
\end{document}

and similar for the author, subject and keywords keys.

latex3 / pdfresources