Conventional upmendex (v0.59 or earlier) classifies characters into several categories of scripts by hard code referring Unicode Blocks and changes procedures of conversion from input strings to strings using in the ICU collator.
It is not appropriate to treat some numbers, symbols, punctuations in tremendously wide character set of Unicode:
Characters which are not listed in the hard coded table are treated as unknown and not supported by upmendex.
However, in order to support the characters, it is not realistic and impossible to implement huge numbers of characters one by one by hard code.
Therefore, I am now planning to introduce checking procedure by character type of General Category in Unicode Script Property.
Direct: Input strings are directly passed to the ICU collator
Look up dictionary: Firstly look up dictionaries. If not found, directly passed to the ICU collator
Look up dictionary or switch by -f option: Firstly look up dictionaries. If not found, switch by the command line -f option whether directly passed to the ICU collator or ignore the unsupported characters.
Conventional upmendex (v0.59 or earlier) classifies characters into several categories of scripts by hard code referring Unicode Blocks and changes procedures of conversion from input strings to strings using in the ICU collator. It is not appropriate to treat some numbers, symbols, punctuations in tremendously wide character set of Unicode: Characters which are not listed in the hard coded table are treated as unknown and not supported by upmendex. However, in order to support the characters, it is not realistic and impossible to implement huge numbers of characters one by one by hard code.
Therefore, I am now planning to introduce checking procedure by character type of General Category in Unicode Script Property.
References: Unicode Blocks Unicode Script Property Unicode Script Property Data File Script.txt latest Unicode General Category Value ICU u_charType() ICU enum UCharCategory
-f
option-f
option: Firstly look up dictionaries. If not found, switch by the command line-f
option whether directly passed to the ICU collator or ignore the unsupported characters.