[ptexenc] Inconsistent error message

aminophen commented 5 years ago

See texjporg/platex#84

aminophen commented 5 years ago

the second "ſ" is converted to "^^c5^^bf".

This is reasonable enough, as "ſ" is "0xC5 0xBF" in UTF-8 byte sequence.

The first "ſ" is converted to "顛"

This is also reasonable enough, as "顛" comes from "0xC5BF" of EUC-JP.

However, I don't understand

! Package inputenc Error: Unicode character 顛 (U+C4CF)
(inputenc)                not set up for use with LaTeX.

why inputenc shows "U+C4CF".

aminophen commented 5 years ago

When I read the comment from @JulienPalard, especially

e-pTeX 3.14159265-p3.8.1-180901-2.6 (utf8.euc) (TeX Live 2019/dev/Debian) kpathsea version 6.3.1/dev ptexenc version 1.3.7/dev (from Debian Buster) So, what did I get wrong?

It works with: e-pTeX 3.14159265-p3.7.1-161114-2.6 (utf8.euc) (TeX Live 2017/Debian) kpathsea version 6.2.3 ptexenc version 1.3.5 from Ubuntu bionic though.

initial thought was a change in pTeX behavior due to #34; however, it turned out to be irrelevant. My guess is: @JulienPalard thought "it worked with TeX Live 2017" because LaTeX ignored UTF-8 input instead of throwing an error. (FYI, the default processing of \usepackage[utf8]{inputenc} started in only TL2018, according to latex3/latex2e#24)

However, that does not answer my question: why does inputenc show "U+C4CF"?

JulienPalard commented 5 years ago

The first "ſ" is converted to "顛"

This is also reasonable enough, as "顛" comes from "0xC5BF" of EUC-JP.

I'm not sure how it's reasonable (I may not understand your sentence properly though), I'm working on a document written in UTF-8 having both CJK characters AND ſ (LATIN SMALL LETTER LONG S) used as an example, along with a kelvin sign and some others.

For reference, it's the PDF version of https://docs.python.org/ja/3/howto/regex.html#compilation-flags, so it's automatically generated Latex by Sphinx.

aminophen commented 5 years ago

I'm working on a document written in UTF-8 having both CJK characters AND ſ (LATIN SMALL LETTER LONG S) used as an example, along with a kelvin sign and some others.

For reference, it's the PDF version of https://docs.python.org/ja/3/howto/regex.html#compilation-flags, so it's automatically generated Latex by Sphinx.

Practically you can try uplatex instead of platex; upLaTeX (upTeX) supports native Unicode characters and it has better compatibility with inputenc package. By design pLaTeX (pTeX) has limited support for Latin characters.

aminophen commented 5 years ago

The first "ſ" is converted to "顛"

This is also reasonable enough, as "顛" comes from "0xC5BF" of EUC-JP.

I'm not sure how it's reasonable

What I meant by "reasonable" was the following: when interpreted favorably, it can be said that such a conversion is a design, because the origin of "顛" could be easily guessed from the behavior ("0xC5BF" of EUC-JP). --- Of course I'm not sure this is actually intended, though.

さて，ここからは日本語で書きます。

疑問点は以下の 2 個になりました。

(1) \message{ſ} でターミナルに表示されるはずの "ſ"（ソース中では UTF-8 のバイト列 "0xC5 0xBF"）が漢字の "顛"（EUC-JP の "0xC5BF"）に変換されたのはなぜ？
(2) inputenc パッケージ使用時のエラー "Unicode character 顛 (U+C4CF)" の "U+C4CF" はどこから来るのか？

(1) の方は，pTeX 3.1.4 で修正された

o ^^形式で入力された文字コードが漢字の第1バイトに当たる場合、次の文字と共に漢字にしようとしてしまうのを修正。

の現象と同じではないのですが，なんだか似たにおいがします。

h-kitagawa commented 5 years ago

(1) \message{ſ} でターミナルに表示されるはずの "ſ"（ソース中では UTF-8 のバイト列 "0xC5 0xBF"）が漢字の "顛"（EUC-JP の "0xC5BF"）に変換されたのはなぜ？

トークンの文字列化（や出力）で使われる print や print_kanji 関数がどのような引数で呼び出されたか調べてみました．その結果，

\message{ſ} % --> <c5><bf>顛
\message{顛} % --> [c5bf]顛

となっており，両者とも print_char(0xc5); print_char(0xbf); が呼び出されていることがわかりました．

pTeX では print_char に 0x80 以降を渡しても（和文文字出力のため）^^c5 の形にしないで出力していますが，「和文文字出力のために呼んだ print_char」かそうでないかでうまく分けられればなあ……と思っています．

t-tk commented 5 years ago

未検証ですが、おそらく現象は

入力の 0xC5 0xBF (UTF-8のſ) が ptexenc で ^^c5^^bf に変換される。
通常、本文で ^^c5^^bf が \usepackage[utf8]{inputenc} により LATIN SMALL LETTER LONG S に変換される。
\message{} の中では、print_char(0xc5); print_char(0xbf); の形で出力されるが ptexenc により EUC の 0xC5BF (顛) → UTF-8 の顛に変換され出力される。

思いつきの解決策の一案は、 8ビットのバイト列の場合は print_char(0xc5); print_char(0xbf); を呼び ptexenc での EUC→UTF-8変換をやらない。一方、EUCの和文の場合は print_kchar(0xc5bf); を呼び ptexenc での EUC→UTF-8変換をやる。

別の案は、「8ビットバイト列のために呼んだ print_char」「和文文字出力のために呼んだ print_char」の場合を何らかのフラグで区別し ptexenc での EUC→UTF-8変換の有無を制御する。

前者は比較的正攻法だが改造量が増えそうです。後者は改造量は小さそうですが安普請かもしれません。上手くいくでしょうか。

JulienPalard commented 5 years ago

@aminophen Thanks for the recommandation of using uplatex, I did not previously heard of it.

Is it possible to use it with sphinx? I don't see it in the enum of latex_engine, and if I still try it I'm getting an error:

! LaTeX Error: This file needs format `pLaTeX2e'
               but this is `LaTeX2e'.

aminophen commented 5 years ago

@JulienPalard I've never used Sphinx, but it seems uplatex is not supported now, according to https://github.com/sphinx-doc/sphinx/issues/4186. You can join the discussion there, and you may get some information on how to add uplatex.

aminophen commented 2 years ago

See https://github.com/texjporg/tex-jp-build/issues/81: hope fixed on r61692

texjporg / tex-jp-build

[ptexenc] Inconsistent error message #80