nushell / reedline

A feature-rich line editor - powering Nushell
https://docs.rs/reedline/
MIT License
516 stars 142 forks source link

Incorrect word boundary in VI mode #563

Open morzel85 opened 1 year ago

morzel85 commented 1 year ago

Platform Ubuntu 20.04 Terminal software gnome-terminal Version Reedline from Nushell 0.77.0

The word boundary (used in b and e) allows too many characters.

In VI mode a word should "consists of sequence of letters, digits and underscores" according to VIM docs, this is also how classic Readline works.

Steps to reproduce

  1. Run Reedline in VI mode.
  2. Type abc.def or abc:def
  3. Use b and e motions and notice that the dot and colon are treated as parts of word
sholderbach commented 1 year ago

We have been piggybacking on the unicode-segmentation crate for most of our word splitting. This has the striking benefit of working with all the non-ASCII complexity flawlessly. But UAX29 has some weird extra rules that can make dots part of the word e.g. for e.g.

You are right that we are not following the defaults for :set iskeyword correctly here. I think we would welcome a change to the word definition to better reflect what vim is doing for the vi mode and could accept a separate set of rules for emacs mode like requested in #570

mjs commented 11 months ago

I've had a look at this as I keep getting caught out by dw and cw not doing what I expect in nushell :grin:. I'm totally new to the reedline code base and am relatively new to Rust so bear with me.

The way unicode-segmentation splits words is super nuanced. It might be a shame to lose the "correct" handling of unicode words that it does. It also doesn't look like there's any way to influence unicode-segmentation's word splitting behaviour from the outside.

I can see a few possible solutions to fixing word splitting in reedline to better match what Vi and Emacs do:

  1. Continue using unicode-segmentation for word splitting but then add an extra step which splits the words it generates further by looking through them for additional characters we want to split on. This is a bit clunky but at least we wouldn't be throwing away all the subtle handling of unicode that unicode-segmentation does. This might work because unicode-segmentation's concept of words tends to be broader than what we want in reedline.
  2. Implement independent word splitting in reedline by walking through the line's graphemes and applying our own rules. Treat all/most characters beyond ASCII as word characters. This should work fine in many cases but there's probably all sorts of unicode edge cases we'd be ignoring that would cause odd behaviour for people not working in straight ASCII.
  3. Vendor/fork unicode-segmentation and add support for customising word splitting.

I'm tempted to try option 1 first as it's the least invasive but I'm open to other ideas / dissenting opinions.

sholderbach commented 11 months ago

Thanks for looking into it!

Yeah 1.) seems pretty reasonable as a starting point. The vim documentation feels a bit sparse to parse out the rules for a 2.) that would behave sensibly with more complex Unicode characters/graphemes. With a bit of effort this should be doable as well, but unicode-segmentation has certainly done some of the heavy lifting already. We certainly don't need to do all the look-up-table work unicode-segmentation has done for ultimate performance.