Multibyte surrounds broken again

Mango0x45 commented 4 months ago

EDIT: Changed example to use surrounds that are more visually-distinct from each other

Checklist

[X] Have you tried updating the plugin to the latest version?
[X] Have you checked the Breaking Changes issue?
[X] Have you read through :h nvim-surround to see if there might be any relevant information there?

Neovim Version

NVIM v0.10.0

Plugin Version

Tagged (Stable)

Minimal Configuration

require('nvim-surround').setup {
    surrounds = {
        ['“'] = {
            add = { '„', '“' },

        },
    }
}

Sample Buffer

* is the cursor position

Äuß*erlich

Keystroke Sequence

ysaw“

Expected behavior

„Äußerlich“

Actual behavior

“Äußerlich“

Additional context

It may be hard to see, but it seems that as of recent if I have a custom surround where the left- and right quotation marks are different (in my case U+2018 LEFT SINGLE QUOTATION MARK and U+2019 RIGHT SINGLE QUOTATION MARK), then my custom surrounds are no longer used, instead the text object is surrounded by whatever key I pressed.

In other words when I surround a word with ysaw<RIGHT QUOTE> I expect a left-quote on the left and a right-quote on the right (as my configuration dictates) but I instead get a right-quote on both ends.

Mango0x45 commented 4 months ago

Upon further investigation I feel this might be related to #326 (cc @bew). I see now that I was tagged in that PR but I didn’t get a notification about it (thanks GitHub).

kylechui commented 4 months ago

Gah! I just realized that the unit test is basically just redoing the default invalid_key_behavior; sorry for the breakage :sob:

bew commented 4 months ago

😬 I can take some time in a few hours to try and debug this

It would really help if you can share a way to reproduce the issue without using a german keyboard layout 😬

Mango0x45 commented 4 months ago

It would really help if you can share a way to reproduce the issue without using a german keyboard layout 😬

Vim has Unicode input (<C-u>201E/<C-u>201C), and you can also use Vim digraphs. The digraphs in my example are :9 and "6.

Of course there is also copy+paste :)

kylechui commented 4 months ago

Inspecting the byte stream that Vim is using to store the surround in the config table reveals:

  ["<e2><80><fe>X<9c>"] = {
    add = <function 55>,
    delete = <function 56>
  }

From my understanding the quote character should just be <e2><80><9c>, so I'm not sure what the <fe>X is doing there :thinking:

kylechui commented 4 months ago

I have diagnosed the issue to be replace_term_keycodes translating intermediary bytes in longer multi-byte characters; working on a fix.

bew commented 4 months ago

From my understanding the quote character should just be <e2><80><9c>, so I'm not sure what the <fe>X is doing there 🤔

@kylechui I saw a number of <fe>X when a key was double-escaped as you can see in my OP here https://github.com/neovim/neovim/issues/29034, maybe another case of this?

Maybe it's because in the config you take the quote that is actually already the raw termcode, and you pass it again to the termcode replacement. But it's unclear when to pass the input key coming from the config, as that quote should not be escaped, but <M-]> should..

kylechui commented 4 months ago

From my understanding the quote character should just be <e2><80><9c>, so I'm not sure what the <fe>X is doing there 🤔

@kylechui I saw a number of <fe>X when a key was double-escaped as you can see in my OP here neovim/neovim#29034, maybe another case of this?

I think in general, nvim_replace_termcodes will parse each byte of its input and replace termcodes with their corresponding byte sequences. It just so happens to be that for some Unicode characters (which are multi-byte), one of the bytes represents some termcode and is getting replaced, which is why the <fe>X is appearing in the middle of the string.

bew commented 4 months ago

So in the config it's unclear when you already have termcodes (the quote) or you should replace the termcode (e.g. <M-]>), a workaround could be to detect with a Lua regex like %b<> if there are any <something> to replace and only replace if so?

kylechui commented 4 months ago

So in the config it's unclear when you already have termcodes (the quote) or you should replace the termcode (e.g. <M-]>), a workaround could be to detect with a Lua regex like %b<> if there are any <something> to replace and only replace if so?

According to the UTF-8 format we can just check the first byte (see buffers.get_first_byte for where I have already done something similar), and return the unmodified string if it is a multi-byte character, otherwise assume it is multiple characters and parse it as a termcode.

kylechui commented 4 months ago

Hey @Mango0x45 can you try #333 ? I had to comment out the unit test since for some reason it's not detecting buffer changes. I did try it manually by loading the command into a macro register and playing it. Please let me know how it goes (in particular, add, delete, and change)

Mango0x45 commented 3 months ago

I’ll take a look!

kylechui commented 3 months ago

@Mango0x45 Tagged as 2.1.11; hopefully this solves the issue once and for all! Please let me know if you notice any more issues and thanks for reporting!

kylechui / nvim-surround