unicode-rs / unicode-width

Displayed width of Unicode characters and strings according to UAX#11 rules.
https://unicode-rs.github.io/unicode-width
Other
217 stars 27 forks source link

Should line breaks be width 0 or 1? #60

Closed Manishearth closed 1 month ago

Manishearth commented 5 months ago

See https://github.com/unicode-rs/unicode-width/issues/55#issuecomment-2155112434

From Unicode:

control characters are not given the Default_Ignorable_Code_Point property. To avoid security problems, such characters [...], when not interpreted and not displayable by normal rendering, should be displayed in fallback rendering with a fallback glyph, so that there is a visible indication of their presence in the text.

Newline control characters (and other break characters) are typically "displayable by normal rendering", however they do not have a rendering that has anything that could be termed as a "width", so there's no single answer as to what this crate should provide here according to the spec.

Applications using unicode-width ought to be using a higher level protocol to split text on newlines (and other similar things) before feeding it to this crate (and perhaps we should document that). It isn't really possible for this crate to provide such a protocol itself since there are nuances to how different contexts handle newlines (Windows vs Unix being the most obvious one, but wrapping behavior also participates).

However, we should pick some consistent, sensible default for what newlines are treated as. My guess is that we should treat them as width 0 (for both CR and LF, and any other line breaks that exist).

h/t/ @decathorpe for noticing

cc @Jules-Bertholet

Jules-Bertholet commented 5 months ago

The case for treating as width 1 is that, in a context where for whatever reason the text is forced to be displayed on a single line, most software will replace line breaks with spaces.

decathorpe commented 5 months ago

For reference, I noticed this behaviour change when investigating test failures caused by the v0.1.12 → v0.1.13 update in the bwrap crate: https://github.com/micl2e2/bwrap

For example, this test is now failing: https://github.com/micl2e2/bwrap/blob/master/tests/maybrk.rs#L64

As far as I can tell, the test tries line-wrapping this string into 3 columns:

hel
lo 
wor
ld

The expected result seems to be that there is no change when line-wrapping this to width 3 (since all lines already are width 3), but since newline characters now contribute an additional +1 to the length, the result is completely mangled.


EDIT: Maybe this logic for line wrapping was fishy all along, but it worked with how unicode-width determined the displayed widths. When counting \n as having width of 1, the logic would need to be adapted to have a special case for an existing \n character immediately after a candidate line-wrapping point.

m-hilgendorf commented 4 months ago

Coming here with the same issue, the update from 0.1.12 -> 0.1.13 broke code that relied on newline characters having width = 0. It should at least have been a minor version update, since the API contract was broken.

in a context where for whatever reason the text is forced to be displayed on a single line, most software will replace line breaks with spaces.

Then surely it's a different string with different width, no?

Manishearth commented 4 months ago

The case for treating as width 1 is that, in a context where for whatever reason the text is forced to be displayed on a single line, most software will replace line breaks with spaces.

That's exceedingly rare, though.


Personally I think the case for the math getting easier especially near existing line break opportunities is compelling enough for it to be 0.

@m-hilgendorf

Then surely it's a different string with different width, no?

No, this is about rendering. When an emoji gets represented as a bunch of tofu that is not a "different string", that is just a rendering choice, and this crate cannot predict all rendering choices. It just makes a best effort.

@decathorpe

EDIT: Maybe this logic for line wrapping was fishy all along, but it worked with how unicode-width determined the displayed widths. When counting \n as having width of 1, the logic would need to be adapted to have a special case for an existing \n character immediately after a candidate line-wrapping point.

Yes, I think this logic is not using this crate correctly and just happened to work. However I think that this kind of pattern should mostly work pleasantly, ideally.

Jules-Bertholet commented 4 months ago

That's exceedingly rare, though.

Not really, no? Most any sort of search box will behave like this, for example.

That being said, if the breakage really is widespread, it may be worth reverting to width 0 for 0.1.x and only doing it for 0.2.x.

m-hilgendorf commented 4 months ago

My point is about code structure and not rendering, if you have let s: String and call s.replace('\n', ' ') you can always choose which value you call .width() upon to get the desired answer (either treating control characters as zero width, or treating them as non-zero after replacement). By removing the distinction a caller has to deal with it with a special case.

Concretely, I have to change code from

let width = string.width();

to

// Not exactly this, but you get the idea
let width = if string == "\n" { 0 } else { string.width() };

As a result of the change. And this is in line wrapping logic, like you mention.

Manishearth commented 4 months ago

Not really, no? Most any sort of search box will behave like this, for example.

Not a terminal, though? And that's an actual text stream transform that happens: it is turned into a space, which makes it the "different string" thing that @m-hilgendorf talks about.

Manishearth commented 4 months ago

@m-hilgendorf I assumed @Jules-Bertholet was alluding to the cases where some software will render an unwrapped linebreak as a space or a control character. Rare but happens, that is about rendering. However he's actually talking about cases where the text stream is replaced, where I agree that that should be treated as a replacement.

Jules-Bertholet commented 4 months ago

Not a terminal, though?

Is this library intended to be terminal-specific? If it is, there are many other things we should be doing differently. (And in any case, a TUI could perfectly well have such an interface)

And that's an actual text stream transform that happens: it is turned into a space

That's one possible implementation, but not the only one. Merely changing how the line-breaking character is rendered would also be perfectly reasonable. (Rendering it as zero-width would not be reasonable, however.)

As for line-wrapping, AIUI the extra logic needed to handle newlines is just an extension of the work needed to handle end-of-line spaces. For example, wrapping hello world into lines of maximum width 5 should result in hello/world, not hello/worl/d; the space character immediately following the break point is discarded, just as a newline at that position would be.

Manishearth commented 4 months ago

Is this library intended to be terminal-specific?

Not exactly, but it's more used that way.

That's one possible implementation, but not the only one.

Yes, but this behavior is most commonly seen in the wild as a text stream replacement. The choice of implementation has other implications, it's not purely internal.

I have seen the "treat linebreak as a 1-width character" behavior before without it being a text stream replacement but it's incredibly rare IME.

Jules-Bertholet commented 4 months ago

I have seen the "treat linebreak as a 1-width character" behavior before without it being a text stream replacement but it's incredibly rare IME.

In my testing, GTK and QT both seem to do it this way, so not that rare I would think?

Manishearth commented 4 months ago

Are they actually doing that or performing a text stream replacement on pasting?

Jules-Bertholet commented 4 months ago

Are they actually doing that or performing a text stream replacement on pasting?

The text stream is not replaced, no—copying back out of the field preserves the newline. (GTK shows a visible arrow glyph, QT a blank space)

Manishearth commented 4 months ago

Hmm, interesting. That does present a legitimate dilemma here.

decathorpe commented 4 months ago

Are they actually doing that or performing a text stream replacement on pasting?

The text stream is not replaced, no—copying back out of the field preserves the newline. (GTK shows a visible arrow glyph, QT a blank space)

This replacement only happens in single-line input fields though, right? So the line break needs to be replaced with something, or things will look broken. That's not really a fair comparison to the line-wrapping use case IMO.

Jules-Bertholet commented 4 months ago

One more point in favor of line breaks being width 1, is that nearly all applications will display them as having advance width when they are selected.


It's true that, for the purpose of determining line widths for line wrapping, line break characters should be treated as having width 0, not 1, and therefore line-wrapping logic needs to special-case them if we keep the present assignments. But, as I mention above, they are not the only kind of character that line-wrapping logic might need to special case. I mention word-separating whitespace above; the soft hyphen is another example. Programs performing line-wrapping will differ in how exactly they handle these characters, just as they will differ in which line-breaks they recognize (out of \n \r\n \r etc); unicode-width's current behavior gives the right fallback for when the application does not define any special handling.

Manishearth commented 4 months ago

That's relatively convincing.

Perhaps we should document this explicitly; that applications shouldn't feed this crate strings with newlines unless they are ok with them being treated as being on the same line with width 1.

decathorpe commented 4 months ago

Ok, so what should crates that do line-wrapping do instead? Replace newlines with spaces before wrapping at a specified width, and treat input string as a single-line string? What about line-wrapping methods that preserve existing line breaks? Both of these will likely require more complicated logic and / or data copying than before.

Manishearth commented 4 months ago

@decathorpe process text one line at a time, splitting into lines first. I wouldn't actually replace anything, the idea would be to never ask this crate for computing the width of anything with a newline. The statefullness of this crate does not extend past newlines so if you still need to add them up afterwards you can. I don't think this would be an extra allocation/perf cost/copying, it will complicate the logic a little bit.

Manishearth commented 4 months ago

But as far as I can tell for line wrapping you need to pay attention to the newlines anyway