helix-editor / helix

A post-modern modal text editor.
https://helix-editor.com
Mozilla Public License 2.0
33.97k stars 2.51k forks source link

Implement movement based on unicode segmentation #11423

Open kirawi opened 3 months ago

kirawi commented 3 months ago

I think it would be good to have better support for other languages where the concept of sentence and word boundaries vary. Implementing word-based movements equivalent to the current ones would be helpful. For example, it is expected that in あいうアイウ, あいう and アイウ are treated as separate words. Implementing this would likely require implementing the word segmentation algorithm in Helix since unicode-segmentation does not offer a way to implement it on non-contiguous text.

RossSmyth commented 3 months ago

There is also the ICU segmenter, but it may be more than desired. It also lacks the ability to load non-contiguous text. https://docs.rs/icu_segmenter/latest/icu_segmenter/

kirawi commented 3 months ago

We could also fork unicode-segmentation and add support for ropey directly

RossSmyth commented 3 months ago

I was curious to see how difficult it would be to do, so I poked at the ICU word segmenter. It turned out to be mostly straight-forward to make it compile. Can be cleaned up and such, this was just a hack job. https://github.com/unicode-org/icu4x/compare/main...RossSmyth:icu4x:Ropey

I did not test anything but it does build.

With the above it should work something like (all untested):

use icu::segmenter::WordSegmenter;
use ropey::Rope;

let segmenter = WordSegmenter::new_auto();
let rope = Rope::from_str("Hello World");

let breakpoints: Vec<usize> =
    segmenter.segment_str(rope.chars()).collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);

let breakpoints: Vec<usize> =
    segmenter.segment_str(rope.chars_at(6)).collect();
assert_eq!(&breakpoints, &[0, 5]);
kirawi commented 3 months ago

I don't think we want to pull in icu because of how many dependencies it brings in.

kirawi commented 3 months ago

@pascalkuthe @the-mikedavis I took a look at creating a fork, and it looks to be feasible to write a trait instead of just a string for segmentation. Do we care about making it generic, or would it be fine to just forego that idea in favor of switching from str to RopeSlice?