k-takata / Onigmo

Onigmo is a regular expressions library forked from Oniguruma.
Other
626 stars 94 forks source link

Support Unicode script extensions #155

Open 747 opened 3 years ago

747 commented 3 years ago

Is there any plan to support the script extensions (scx) property, which allows characters to have non-singular script identities? It has been available in many dynamic languages such as Perl, PHPPython, JavaScript (recently) etc., and would greatly improve the usefulness against the real-world text.

For example, in JS after ES2018:

// match by script (= Ruby /[\p{Hani}\p{Hira}\p{Kana}]+/)
"ア行〜タ行のデータ".match(/[\p{sc=Hani}\p{sc=Hira}\p{sc=Kana}]+/gu);
// => [ "ア行", "タ行のデ", "タ" ]

// match by script_extensions
"ア行〜タ行のデータ".match(/[\p{scx=Hani}\p{scx=Hira}\p{scx=Kana}]+/gu);
// => [ "ア行〜タ行のデータ" ]

While not being the silver bullet due to the Unicode complications, it will catch most of the common pitfalls on Unicode script matching. Manually reproducing the equivalent of scx properties with the vanilla script property can often result in a non-trivial expression.

# implement \p{scx=Hira} equivalent
/[\p{Hira}、-〃〈-】〓-〟〰-〵〷〼〽\u3099-゜゠・ー﹅﹆。-・ー゙゚]/

Sorry if already discussed somewhere, but at least I couldn't find a relevant issue in this repository.