Open morzel85 opened 1 year ago
We have been piggybacking on the unicode-segmentation
crate for most of our word splitting. This has the striking benefit of working with all the non-ASCII complexity flawlessly. But UAX29 has some weird extra rules that can make dots part of the word e.g. for e.g.
You are right that we are not following the defaults for :set iskeyword
correctly here.
I think we would welcome a change to the word definition to better reflect what vim is doing for the vi mode and could accept a separate set of rules for emacs mode like requested in #570
I've had a look at this as I keep getting caught out by dw
and cw
not doing what I expect in nushell :grin:. I'm totally new to the reedline code base and am relatively new to Rust so bear with me.
The way unicode-segmentation
splits words is super nuanced. It might be a shame to lose the "correct" handling of unicode words that it does. It also doesn't look like there's any way to influence unicode-segmentation
's word splitting behaviour from the outside.
I can see a few possible solutions to fixing word splitting in reedline to better match what Vi and Emacs do:
unicode-segmentation
for word splitting but then add an extra step which splits the words it generates further by looking through them for additional characters we want to split on. This is a bit clunky but at least we wouldn't be throwing away all the subtle handling of unicode that unicode-segmentation
does. This might work because unicode-segmentation
's concept of words tends to be broader than what we want in reedline.unicode-segmentation
and add support for customising word splitting.I'm tempted to try option 1 first as it's the least invasive but I'm open to other ideas / dissenting opinions.
Thanks for looking into it!
Yeah 1.) seems pretty reasonable as a starting point. The vim documentation feels a bit sparse to parse out the rules for a 2.) that would behave sensibly with more complex Unicode characters/graphemes.
With a bit of effort this should be doable as well, but unicode-segmentation
has certainly done some of the heavy lifting already. We certainly don't need to do all the look-up-table work unicode-segmentation
has done for ultimate performance.
Platform Ubuntu 20.04 Terminal software gnome-terminal Version Reedline from Nushell 0.77.0
The word boundary (used in
b
ande
) allows too many characters.In VI mode a word should "consists of sequence of letters, digits and underscores" according to VIM docs, this is also how classic Readline works.
Steps to reproduce
b
ande
motions and notice that the dot and colon are treated as parts of word