Closed aminophen closed 2 years ago
the second "ſ" is converted to "^^c5^^bf".
This is reasonable enough, as "ſ" is "0xC5 0xBF" in UTF-8 byte sequence.
The first "ſ" is converted to "顛"
This is also reasonable enough, as "顛" comes from "0xC5BF" of EUC-JP.
However, I don't understand
! Package inputenc Error: Unicode character 顛 (U+C4CF)
(inputenc) not set up for use with LaTeX.
why inputenc shows "U+C4CF".
When I read the comment from @JulienPalard, especially
e-pTeX 3.14159265-p3.8.1-180901-2.6 (utf8.euc) (TeX Live 2019/dev/Debian) kpathsea version 6.3.1/dev ptexenc version 1.3.7/dev (from Debian Buster) So, what did I get wrong?
It works with: e-pTeX 3.14159265-p3.7.1-161114-2.6 (utf8.euc) (TeX Live 2017/Debian) kpathsea version 6.2.3 ptexenc version 1.3.5 from Ubuntu bionic though.
initial thought was a change in pTeX behavior due to #34; however, it turned out to be irrelevant. My guess is: @JulienPalard thought "it worked with TeX Live 2017" because LaTeX ignored UTF-8 input instead of throwing an error. (FYI, the default processing of \usepackage[utf8]{inputenc} started in only TL2018, according to latex3/latex2e#24)
However, that does not answer my question: why does inputenc show "U+C4CF"?
The first "ſ" is converted to "顛"
This is also reasonable enough, as "顛" comes from "0xC5BF" of EUC-JP.
I'm not sure how it's reasonable (I may not understand your sentence properly though), I'm working on a document written in UTF-8 having both CJK characters AND ſ (LATIN SMALL LETTER LONG S
) used as an example, along with a kelvin sign and some others.
For reference, it's the PDF version of https://docs.python.org/ja/3/howto/regex.html#compilation-flags, so it's automatically generated Latex by Sphinx.
I'm working on a document written in UTF-8 having both CJK characters AND ſ (LATIN SMALL LETTER LONG S) used as an example, along with a kelvin sign and some others.
For reference, it's the PDF version of https://docs.python.org/ja/3/howto/regex.html#compilation-flags, so it's automatically generated Latex by Sphinx.
Practically you can try uplatex instead of platex; upLaTeX (upTeX) supports native Unicode characters and it has better compatibility with inputenc package. By design pLaTeX (pTeX) has limited support for Latin characters.
The first "ſ" is converted to "顛"
This is also reasonable enough, as "顛" comes from "0xC5BF" of EUC-JP.
I'm not sure how it's reasonable
What I meant by "reasonable" was the following: when interpreted favorably, it can be said that such a conversion is a design, because the origin of "顛" could be easily guessed from the behavior ("0xC5BF" of EUC-JP). --- Of course I'm not sure this is actually intended, though.
さて,ここからは日本語で書きます。
疑問点は以下の 2 個になりました。
\message{ſ}
でターミナルに表示されるはずの "ſ"(ソース中では UTF-8 のバイト列 "0xC5 0xBF")が漢字の "顛"(EUC-JP の "0xC5BF")に変換されたのはなぜ?(1) の方は,pTeX 3.1.4 で修正された
o ^^形式で入力された文字コードが漢字の第1バイトに当たる場合、 次の文字と共に漢字にしようとしてしまうのを修正。
の現象と同じではないのですが,なんだか似たにおいがします。
(1) \message{ſ} でターミナルに表示されるはずの "ſ"(ソース中では UTF-8 のバイト列 "0xC5 0xBF")が漢字の "顛"(EUC-JP の "0xC5BF")に変換されたのはなぜ?
トークンの文字列化(や出力)で使われる print
や print_kanji
関数がどのような引数で呼び出されたか調べてみました.その結果,
\message{ſ} % --> <c5><bf>顛
\message{顛} % --> [c5bf]顛
となっており,両者とも print_char(0xc5); print_char(0xbf);
が呼び出されていることがわかりました.
pTeX では print_char
に 0x80 以降を渡しても(和文文字出力のため)^^c5
の形にしないで出力していますが,「和文文字出力のために呼んだ print_char
」かそうでないかでうまく分けられればなあ……と思っています.
未検証ですが、おそらく現象は
print_char(0xc5); print_char(0xbf);
の形で出力されるが ptexenc により EUC の 0xC5BF (顛) → UTF-8 の 顛 に変換され出力される。思いつきの解決策の一案は、
8ビットのバイト列の場合は print_char(0xc5); print_char(0xbf);
を呼び ptexenc での EUC→UTF-8変換をやらない。一方、EUCの和文の場合は print_kchar(0xc5bf);
を呼び ptexenc での EUC→UTF-8変換をやる。
別の案は、
「8ビットバイト列のために呼んだ print_char
」「和文文字出力のために呼んだ print_char
」の場合を何らかのフラグで区別し ptexenc での EUC→UTF-8変換の有無を制御する。
前者は比較的正攻法だが改造量が増えそうです。後者は改造量は小さそうですが安普請かもしれません。上手くいくでしょうか。
@aminophen Thanks for the recommandation of using uplatex, I did not previously heard of it.
Is it possible to use it with sphinx? I don't see it in the enum of latex_engine, and if I still try it I'm getting an error:
! LaTeX Error: This file needs format `pLaTeX2e'
but this is `LaTeX2e'.
@JulienPalard I've never used Sphinx, but it seems uplatex is not supported now, according to https://github.com/sphinx-doc/sphinx/issues/4186. You can join the discussion there, and you may get some information on how to add uplatex.
See https://github.com/texjporg/tex-jp-build/issues/81: hope fixed on r61692
See texjporg/platex#84