martanne / vis

A vi-like editor based on Plan 9's structural regular expressions
Other
4.27k stars 262 forks source link

Byte order marker (BOM) is displayed as empty cell #1095

Open njhanley opened 1 year ago

njhanley commented 1 year ago

The byte order marker (BOM) is the use of a zero width no-break space character (U+FEFF) at the start of a file to indicate the encoding byte order in UTF-16/32. While not useful in UTF-8, it is legal and occasionally used as a signature to indicate UTF-8 encoding.

Consider this file: bom.txt When opened in vis, the BOM is visible as a blank cell when it should be invisible. Interestingly, ZWNBSP is correctly displayed (or rather not displayed) when part of the rest of the file.

https://unicode.org/faq/utf_bom.html#BOM

mcepl commented 1 year ago

With reference to https://github.com/martanne/vis/wiki/FAQ#how-should-i-edit-files-in-legacy-encodings I would suggest WONTFIX here. vis (in comparison to vim) doesn’t go into business of dealing with encodings (and CRLF v LF), and it is just plain text editor. If anybody wants to get rid of BOM, there are ways how to do it. Also, if you are dealing with text files originating from that platform, you may well know that dos2unix removes BOM as well.

Yes, BOM in UTF-8 is an abomination of lesser platforms (so called “operating systems”), which punish everybody else for their unfortunate decision to use double-byte encoding for text, UTF-8 doesn’t need BOM, but whole that business should be kept outside of vis in my opinion.

njhanley commented 1 year ago

The issue isn't that vis should interpret or remove BOMs; it's that a ZWNBSP at the start of a file (a BOM) is currently rendered differently from a ZWNBSP elsewhere in the file. See zwnbsp.txt. The ZWNBSP between 'H' and 'e' is correctly rendered as invisible.

mcepl commented 1 year ago

Cannot reproduce here, with vis v0.8-git +curses +lua +tre +acl +selinux I get

screenshot-2023-05-06_22-05-1683406371

rnpnr commented 1 year ago

That was the point. If you open bom.txt vis consumes the cursor and the window renders incorrectly. In zwnbsp.txt the same bytes are present between h and e but vis correctly renders them as invisible and it doesn't effect the rest of the ui. You will have to use something like od to see the bytes eg: od -t x1 bom.txt

I have noticed this problem before but usually I just press x and delete the character if the file has it at the start because I really don't care about the file being compatible with where it came from.

njhanley commented 1 year ago

The same behavior can be seen with other zero width characters such as zero-width space (ZWSP) and word joiner (WJ).

zwsp-start.txt vs zwsp-middle.txt wj-start.txt vs wj-middle.txt

mcepl commented 1 year ago

I still believe that the principle matters: all shenanigans with incorrectly encoded files (and yes a file with BOM is incorrectly encoded one) should stay outside of vis and by definition are NOT a vis problem.

rnpnr commented 1 year ago

I agree with the principle but I also don't like that the ui gets garbled by files like bom.txt. I suspect that its a one or two line fix to stop that from happening. If such a patch is presented I would see no issue with including it.

mcepl commented 1 year ago

Sure, if it is so, then I guess, “SHOW ME THE PATCH!”. Also, what should happen with the content of the file? Should BOM should be just hidden but untouched in the file, or should it be really eliminated?

rnpnr commented 1 year ago

Leave it untouched like what happens when the bytes appear in the middle of the file.

I'll look into it later if I have time but I suspect what is happening is that vis is decrementing the index of where the next character is supposed to be drawn one cell too many when its the first character in the line. Then everything is off by one for rest of window.