t-tk / upmendex-package

Source/Document distribution of upmendex --- multilingual index processor
Other
5 stars 1 forks source link

treatment by Unicode character types #8

Closed t-tk closed 3 years ago

t-tk commented 3 years ago

Conventional upmendex (v0.59 or earlier) classifies characters into several categories of scripts by hard code referring Unicode Blocks and changes procedures of conversion from input strings to strings using in the ICU collator. It is not appropriate to treat some numbers, symbols, punctuations in tremendously wide character set of Unicode: Characters which are not listed in the hard coded table are treated as unknown and not supported by upmendex. However, in order to support the characters, it is not realistic and impossible to implement huge numbers of characters one by one by hard code.

Therefore, I am now planning to introduce checking procedure by character type of General Category in Unicode Script Property.

References: Unicode Blocks Unicode Script Property Unicode Script Property Data File Script.txt latest Unicode General Category Value ICU u_charType() ICU enum UCharCategory

upmendex charType example procedure
Latin Lu, Li, Lo etc. ABCabcAaⓐⒷ Direct
Greek Lu, Li, Lo etc. ΑΒΓαβγ Direct
Cyrillic Lu, Li, Lo etc. АБВабв Direct
Thai Lo etc. กขฃคฅฆ Direct
Devanagari Lo etc. अइउऋऌए Direct
Kana Lo etc. あいうアイウア㋐㌀ Direct
Hangul Lo etc. 가나다ᄁ㉡㉰㉼ Direct
Hanzi Lo etc. 花鳥風月 Look up dictionary
Number Nd: dicimal digit mumber 01212๑๒१२ Direct
Number No: other number ¹₂③❹➄➏🄈⑻⒐ Look up dictionary
Symbol Sk: modifier symbol ¨¯´˯˳꭛^`゛゜ Direct
Symbol Sm: math symbol ÷⁺℘⅀↠▷♯ Look up dictionary
Symbol So: other symbol ☃☎♥⚽☺ Look up dictionary
Symbol Sc: currency symbol €$$¢¢££¥¥ Look up dictionary
Symbol Lm: modifier letter ⸯ〱ーꞈ Direct
Symbol Pd: dash punctuation ‐—― Direct
Symbol Ps: start punctuation ‚⁅ Direct
Symbol Pe: end punctuation Direct
Symbol Pc: connector punctuation ‿⁀⁔ Direct
Symbol Po: other punctuation ⁇⁈⁉¡¿†#%*§¶ Direct
Symbol Pi: initial punctuation ⸂⸄ Direct
Symbol Pf: final punctuation ⸃⸅ Direct
Symbol Mn: non spacing mark ◌꙯ Direct
Symbol Me: enclosing mark ҈ Direct
Symbol Mc: combining spacing mark 𝅦 ◌𝅩 Direct
Unknown Cc: control character Ignore
Unknown Cf: format character Direct
Symbol Others. Not supported by upmendex Look up dictionary or switch by -f option
t-tk commented 3 years ago

I have committed to TeX Live svn r60856. https://github.com/TeX-Live/texlive-source/commit/15f827c5acfb2fcd13064b6b4d2fb7f6bca7c7e0

t-tk commented 1 year ago

関連トピックス: https://okumuralab.org/tex/mod/forum/discuss.php?d=3512