messense / jieba-rs

The Jieba Chinese Word Segmentation Implemented in Rust
MIT License
756 stars 47 forks source link

Come up with a way to handle extended grapheme clusters #24

Open MnO2 opened 5 years ago

MnO2 commented 5 years ago

Examples here: https://developer.apple.com/swift/blog/?id=30

"abcde\u{0301}\u{1100}\u{1161}\u{AC00}" should not be segmented as "abcde" and "\u{0301}\u{1100}\u{1161}\u{AC00}". "e\u{0301}" should be together.

MnO2 commented 5 years ago

https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation/struct.Graphemes.html This could be considered. But only required if the behaviour of re_han results into incorrect segmentation by SplitMatch