Open kirawi opened 3 months ago
There is also the ICU segmenter, but it may be more than desired. It also lacks the ability to load non-contiguous text. https://docs.rs/icu_segmenter/latest/icu_segmenter/
We could also fork unicode-segmentation and add support for ropey directly
I was curious to see how difficult it would be to do, so I poked at the ICU word segmenter. It turned out to be mostly straight-forward to make it compile. Can be cleaned up and such, this was just a hack job. https://github.com/unicode-org/icu4x/compare/main...RossSmyth:icu4x:Ropey
I did not test anything but it does build.
With the above it should work something like (all untested):
use icu::segmenter::WordSegmenter;
use ropey::Rope;
let segmenter = WordSegmenter::new_auto();
let rope = Rope::from_str("Hello World");
let breakpoints: Vec<usize> =
segmenter.segment_str(rope.chars()).collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);
let breakpoints: Vec<usize> =
segmenter.segment_str(rope.chars_at(6)).collect();
assert_eq!(&breakpoints, &[0, 5]);
I don't think we want to pull in icu because of how many dependencies it brings in.
@pascalkuthe @the-mikedavis I took a look at creating a fork, and it looks to be feasible to write a trait instead of just a string for segmentation. Do we care about making it generic, or would it be fine to just forego that idea in favor of switching from str
to RopeSlice
?
I think it would be good to have better support for other languages where the concept of sentence and word boundaries vary. Implementing word-based movements equivalent to the current ones would be helpful. For example, it is expected that in あいうアイウ, あいう and アイウ are treated as separate words. Implementing this would likely require implementing the word segmentation algorithm in Helix since
unicode-segmentation
does not offer a way to implement it on non-contiguous text.