messense / jieba-rs

The Jieba Chinese Word Segmentation Implemented in Rust
MIT License
738 stars 46 forks source link

The current implementation only considers BMP for Han, we should consider the range defined by Unicode 10.0 #20

Closed MnO2 closed 5 years ago

MnO2 commented 5 years ago

https://en.wikipedia.org/wiki/CJK_Unified_Ideographs

I checked the python implementation had the wrong assumptions as well, and there was a long pending issue without any progress.

MnO2 commented 5 years ago

Pull Request: https://github.com/messense/jieba-rs/pull/25