Closed Manishearth closed 2 months ago
The case for treating as width 1 is that, in a context where for whatever reason the text is forced to be displayed on a single line, most software will replace line breaks with spaces.
For reference, I noticed this behaviour change when investigating test failures caused by the v0.1.12 → v0.1.13 update in the bwrap
crate: https://github.com/micl2e2/bwrap
For example, this test is now failing: https://github.com/micl2e2/bwrap/blob/master/tests/maybrk.rs#L64
As far as I can tell, the test tries line-wrapping this string into 3 columns:
hel
lo
wor
ld
The expected result seems to be that there is no change when line-wrapping this to width 3 (since all lines already are width 3), but since newline characters now contribute an additional +1 to the length, the result is completely mangled.
EDIT: Maybe this logic for line wrapping was fishy all along, but it worked with how unicode-width determined the displayed widths. When counting \n
as having width of 1, the logic would need to be adapted to have a special case for an existing \n
character immediately after a candidate line-wrapping point.
Coming here with the same issue, the update from 0.1.12 -> 0.1.13 broke code that relied on newline characters having width = 0. It should at least have been a minor version update, since the API contract was broken.
in a context where for whatever reason the text is forced to be displayed on a single line, most software will replace line breaks with spaces.
Then surely it's a different string with different width, no?
The case for treating as width 1 is that, in a context where for whatever reason the text is forced to be displayed on a single line, most software will replace line breaks with spaces.
That's exceedingly rare, though.
Personally I think the case for the math getting easier especially near existing line break opportunities is compelling enough for it to be 0.
@m-hilgendorf
Then surely it's a different string with different width, no?
No, this is about rendering. When an emoji gets represented as a bunch of tofu that is not a "different string", that is just a rendering choice, and this crate cannot predict all rendering choices. It just makes a best effort.
@decathorpe
EDIT: Maybe this logic for line wrapping was fishy all along, but it worked with how unicode-width determined the displayed widths. When counting \n as having width of 1, the logic would need to be adapted to have a special case for an existing \n character immediately after a candidate line-wrapping point.
Yes, I think this logic is not using this crate correctly and just happened to work. However I think that this kind of pattern should mostly work pleasantly, ideally.
That's exceedingly rare, though.
Not really, no? Most any sort of search box will behave like this, for example.
That being said, if the breakage really is widespread, it may be worth reverting to width 0 for 0.1.x and only doing it for 0.2.x.
My point is about code structure and not rendering, if you have let s: String
and call s.replace('\n', ' ')
you can always choose which value you call .width()
upon to get the desired answer (either treating control characters as zero width, or treating them as non-zero after replacement). By removing the distinction a caller has to deal with it with a special case.
Concretely, I have to change code from
let width = string.width();
to
// Not exactly this, but you get the idea
let width = if string == "\n" { 0 } else { string.width() };
As a result of the change. And this is in line wrapping logic, like you mention.
Not really, no? Most any sort of search box will behave like this, for example.
Not a terminal, though? And that's an actual text stream transform that happens: it is turned into a space, which makes it the "different string" thing that @m-hilgendorf talks about.
@m-hilgendorf I assumed @Jules-Bertholet was alluding to the cases where some software will render an unwrapped linebreak as a space or a control character. Rare but happens, that is about rendering. However he's actually talking about cases where the text stream is replaced, where I agree that that should be treated as a replacement.
Not a terminal, though?
Is this library intended to be terminal-specific? If it is, there are many other things we should be doing differently. (And in any case, a TUI could perfectly well have such an interface)
And that's an actual text stream transform that happens: it is turned into a space
That's one possible implementation, but not the only one. Merely changing how the line-breaking character is rendered would also be perfectly reasonable. (Rendering it as zero-width would not be reasonable, however.)
As for line-wrapping, AIUI the extra logic needed to handle newlines is just an extension of the work needed to handle end-of-line spaces. For example, wrapping hello world
into lines of maximum width 5 should result in hello
/world
, not hello
/worl
/d
; the space character immediately following the break point is discarded, just as a newline at that position would be.
Is this library intended to be terminal-specific?
Not exactly, but it's more used that way.
That's one possible implementation, but not the only one.
Yes, but this behavior is most commonly seen in the wild as a text stream replacement. The choice of implementation has other implications, it's not purely internal.
I have seen the "treat linebreak as a 1-width character" behavior before without it being a text stream replacement but it's incredibly rare IME.
I have seen the "treat linebreak as a 1-width character" behavior before without it being a text stream replacement but it's incredibly rare IME.
In my testing, GTK and QT both seem to do it this way, so not that rare I would think?
Are they actually doing that or performing a text stream replacement on pasting?
Are they actually doing that or performing a text stream replacement on pasting?
The text stream is not replaced, no—copying back out of the field preserves the newline. (GTK shows a visible arrow glyph, QT a blank space)
Hmm, interesting. That does present a legitimate dilemma here.
Are they actually doing that or performing a text stream replacement on pasting?
The text stream is not replaced, no—copying back out of the field preserves the newline. (GTK shows a visible arrow glyph, QT a blank space)
This replacement only happens in single-line input fields though, right? So the line break needs to be replaced with something, or things will look broken. That's not really a fair comparison to the line-wrapping use case IMO.
One more point in favor of line breaks being width 1, is that nearly all applications will display them as having advance width when they are selected.
It's true that, for the purpose of determining line widths for line wrapping, line break characters should be treated as having width 0, not 1, and therefore line-wrapping logic needs to special-case them if we keep the present assignments. But, as I mention above, they are not the only kind of character that line-wrapping logic might need to special case. I mention word-separating whitespace above; the soft hyphen is another example. Programs performing line-wrapping will differ in how exactly they handle these characters, just as they will differ in which line-breaks they recognize (out of \n
\r\n
\r
etc); unicode-width's current behavior gives the right fallback for when the application does not define any special handling.
That's relatively convincing.
Perhaps we should document this explicitly; that applications shouldn't feed this crate strings with newlines unless they are ok with them being treated as being on the same line with width 1.
Ok, so what should crates that do line-wrapping do instead? Replace newlines with spaces before wrapping at a specified width, and treat input string as a single-line string? What about line-wrapping methods that preserve existing line breaks? Both of these will likely require more complicated logic and / or data copying than before.
@decathorpe process text one line at a time, splitting into lines first. I wouldn't actually replace anything, the idea would be to never ask this crate for computing the width of anything with a newline. The statefullness of this crate does not extend past newlines so if you still need to add them up afterwards you can. I don't think this would be an extra allocation/perf cost/copying, it will complicate the logic a little bit.
But as far as I can tell for line wrapping you need to pay attention to the newlines anyway
See https://github.com/unicode-rs/unicode-width/issues/55#issuecomment-2155112434
From Unicode:
Newline control characters (and other break characters) are typically "displayable by normal rendering", however they do not have a rendering that has anything that could be termed as a "width", so there's no single answer as to what this crate should provide here according to the spec.
Applications using
unicode-width
ought to be using a higher level protocol to split text on newlines (and other similar things) before feeding it to this crate (and perhaps we should document that). It isn't really possible for this crate to provide such a protocol itself since there are nuances to how different contexts handle newlines (Windows vs Unix being the most obvious one, but wrapping behavior also participates).However, we should pick some consistent, sensible default for what newlines are treated as. My guess is that we should treat them as width 0 (for both CR and LF, and any other line breaks that exist).
h/t/ @decathorpe for noticing
cc @Jules-Bertholet