vim / vim

The official Vim repository
https://www.vim.org
Vim License
36.29k stars 5.42k forks source link

UTF8 normalisation of single byte code points stored in multiple bytes #2824

Open neerolyte opened 6 years ago

neerolyte commented 6 years ago

Vim appears to be normalising multi-byte representations of single byte code points when displaying it, e.g:

$ echo $'\xc1\xb2\xc1\xaf\xc1\xaf\xc1\xb4 = root?' | view -

image

I'm concerned that this could be used to trick users in to thinking that something they're viewing in a file with vim is something other than it appears and potentially used maliciously.

I can reproduce on versions of Vim across OSes and both in graphical and CLI modes.

Copying the text out of vim and passing it back through something like xxd confirms the display (and not my terminal) has normalised it to the single byte versions of the characters, where as any other editor I test seems aware that they should be treated as invalid characters.

Emacs displays it like:

image

Which seems pretty useful for being aware there's something dodgy going on.

But even this would be ok:

image

bfredl commented 6 years ago

related issue #2089. I have a patch for this https://github.com/vim/vim/commit/c6f93d6f6ce23a24970bcbb90b72f7cf6f5a352c, does its behavior look sane to you?

tonymec commented 6 years ago

Vim has always (AFAIR) normalized overlong UTF-8 sequences on display. AFAIK it also accepts codepoints U+110000 to U+7FFFFF as well as the last two codepoints of each plane (U+nFFFE and U+nFFFF), all of which are invalid and "will never be used" according to the Unicode Consortium. Is that a problem? According to the arithmetics of UTF-8, 0xC1 0xB2 (if it were valid) could be none other than lowercase r. Neither could (to be extreme) 0xFC 0x80 0x80 0x80 0x81 0xB2.

tonymec commented 6 years ago

BTW, Emacs representation is incorrect. Octal 362 357 357 364 i.e hex F2 E7 E7 F4 corresponds to òïïô or, in UTF-8, 0xC3 0xB2 0xC3 0xAF 0xC3 0xAF 0xC3 0xB4. What looks insane to me is Emacs's behaviour.

neerolyte commented 6 years ago

@bfredl I think you're right that the patch in #2089 would resolve this well enough, at least they're clearly illegal characters with that in place (I haven't done extensive review of the math in the code though).

@tonymec You're right the numbers Emacs has chosen to display are bizarre.