Why are control characters treated as zero-width for strings?

unicode-rs / unicode-width

Displayed width of Unicode characters and strings according to UAX#11 rules.

Other

217 stars 27 forks source link

From the docs:

fn width<'a>(&'a self) -> usize Returns the string's displayed width in columns. Control characters are treated as having zero width.

(Ignore '\0' for the points below as it has special treatment.)

This seems inconsistent with the behaviour for individual chars, where None is returned in case you have a control character. For consistency, I would expect (A) for a string, if any character has a width of None, the result should have width None XOR (B) control characters always have width Some(0).

IIUC, the second option hasn't been taken for consistency with wcwidth, which returns -1 for control characters. However, not taking the first option can lead to non-intuitive behaviour that can go by unnoticed. E.g. if the code has LF/TAB/DEL in it, then you can get an answer that doesn't make much sense.

Moreover, this violates an embedding law that one might expect to hold: width(format!("{}", c)) == width(c) (because it doesn't even type-check).

What is the reasoning behind the current behaviour?

P.S. I'm not asking for the library's behaviour to be changed. I'm writing a Haskell implementation and ran into this while looking at the test cases. My library follows (A) because it seemed like the right choice, so I wanted to know why you didn't pick (A).

What is the reasoning behind the current behaviour?

I may be able to answer this for you. From https://github.com/jquast/wcwidth/issues/54#issuecomment-1858569488

I just want to also add that this cannot be fixed in the wcwidth() and wcswidth() functions, as they intend to exactly match function signature and behavior of the POSIX functions.

The reason that C0 and C1 control characters return -1, is that the intended application, a terminal emulator especially, should handle these characters in a stream and remove them from the string before passing on to wcswidth. Especially items like \n, \b, and \t. They become complicated, it depends on the current position of the cursor, and also terminal settings, for example \b can wrap to previous row if it is located at column 0, and the number of spaces incurred by '\t' are dependent on the tab stop setting and the current cursor position. C1 characters like '\x1b' may begin a terminal escape sequence, and that too should be processed before sending to wcswidth, etc.

unicode-rs / unicode-width

Why are control characters treated as zero-width for strings? #6