Closed joshka closed 5 months ago
This is a behavior change and not considered breaking by this crate.
This is a behavior change
That's literally what semver suggests. You broke the implied contract that 0.1.12 should be compatible with 0.1.13 and in doing so broke our tests, which implies that downstream user's apps will be rendered differently.
"should be compatible with" has an extremely broad range of meanings. I'm careful to make sure that code will compile; but changing the code to have more accurate answers is not a breakage.
The Rust stdlib's semver policy (which is roughly what the ecosystem follows as well) talks about behavioral changes as a notion of contract. It really depends on the situation, and I am thoroughly unconvinced that this crate already has or needs such a strong stability contract.
This crate says it attempts to approximate the displayed width of characters. That's not a guarantee of stability; and we've (well, @Jules-Bertholet) been trying to improve the accuracy of these heuristics.
You are free to rely on such behavior in tests, but you do have to expect that tests will break some time due to this kind of thing. If you want to pin the behavior, pin the dependency.
There is an argument to be made for an more stable (not 100% stable: the unicode standard and data this relies on itself changes!) API that uses the east asian width properties only (@Jules-Bertholet: would this be easy to add again? I was somewhat uncomfortable about expanding the scope of the crate, but UAX 11 is pretty vague about this stuff anyway)
The Rust stdlib's semver policy (which is roughly what the ecosystem follows as well) talks about behavioral changes as a notion of contract. It really depends on the situation, and I am thoroughly unconvinced that this crate already has or needs such a strong stability contract.
from the link:
In general, APIs are expected to provide explicit contracts for their behavior via documentation, and behavior that is not part of this contract is permitted to change in minor revisions.
A changelog is the exact form of documentation that would have helped notify this change in behavior, but you've rejected adding one in https://github.com/unicode-rs/unicode-width/pull/56
the behavior change here is that the width of `"\u{1}" is now 1 instead of 0. It is however rendered as taking up no space as evidenced by the test output.
Working out what the change is caused by and why is next to impossible with the current approach (I have to look at every single commit that was made to work out relevance, and then backtrack to find a related PR). This is not ideal and there are easy ways that help consumers of this library work understand the changes more easily.
It is however rendered as taking up no space as evidenced by the test output.
This is highly highly dependent on the exact software stack used and is why this crate is explicitly documented as not necessarily matching actual rendered output.
Yes, the documentation could be clearer about how it approximates "displayed width", and why it is an ultimately futile effort to try and accurately predict display width from just codepoints. You've been lucky so far that your tests were dealing with cases that matched up.
If you want such an operation, this is not the crate for you, and furthermore, that operation does not exist.
(I've been meaning to write documentation on why that is the case for a while, but haven't had the time)
A changelog is the exact form of documentation that would have helped notify this change in behavior, but you've rejected adding one in https://github.com/unicode-rs/unicode-width/pull/56
I don't think that's what the RFC means by "documentation".
But seriously, this is absolutely not my experience with semver in the Rust community: this kind of behavior change is almost always fine. Of course making a function do something entirely different would be breaking, but changing functions to still continue to do what they are documented as doing. This function still does what it is documented as doing.
(I have to look at every single commit that was made to work out relevance, and then backtrack to find a related PR)
That's ... what you have to do with the changelog too? I rejected the changelog because it produces more or less the same output as scrolling through commits.
That's literally what semver suggests.
SemVer disagrees with the Rust community on what SemVer means..
To be clear - I'm using the Rust definition of semver, which treats 0.x.y the same way semver would treat x.y.0
https://doc.rust-lang.org/cargo/reference/resolver.html#semver-compatibility
This notion of “compatibility” is important because Cargo assumes it should be safe to update a dependency within a compatibility range without breaking the build.
Versions are considered compatible if their left-most non-zero major/minor/patch component is the same. ... This convention also applies to versions with leading zeros. For example, 0.1.0 and 0.1.2 are compatible, but 0.1.0 and 0.2.0 are not. Similarly, 0.0.1 and 0.0.2 are not compatible.
Sometimes a project may inadvertently publish a point release with a SemVer-breaking change. When users update with cargo update, they will pick up this new release, and then their build may break. In this situation, it is recommended that the project should yank the release, and either remove the SemVer-breaking change, or publish it as a new SemVer-major version increase.
...
If it looks like the third-party project is unable or unwilling to yank the release, then one option is to update your code to be compatible with the changes, and update the dependency requirement to set the minimum version to the new release.
This is the path that we are taking in https://github.com/ratatui-org/ratatui/pull/1171, but it's a poor one given the blast radius of this sort of breaking change. The changes here are:
The right thing to do is yank and publish 0.2, and document breaking changes like this. But as this issue is closed and we've worked around it for our particular use case, I'll leave pushing this issue forward to any of the 584 other direct dependencies of unicode-width that have actual breaks cause by this.
Our reliance on \0
and \u{1}
were just as being known zero-width characters to use in rendering tests that had to deal with how to render zero-width characters at various places in terminal UI. These particular characters changed width inbetween 0.1.12 and 0.1.13 for some reason.
Yeah, it changed because the Unicode standard recommends they be rendered unless they are interpretable in that context (which these days is almost never).
https://github.com/unicode-rs/unicode-width/pull/45#issue-2288103514
This crate implements Unicode standards.
There is an argument to be made for an more stable (not 100% stable: the unicode standard and data this relies on itself changes!) API that uses the east asian width properties only (@Jules-Bertholet: would this be easy to add again? I was somewhat uncomfortable about expanding the scope of the crate, but UAX 11 is pretty vague about this stuff anyway)
An API that uses the East_Asian_Width
property only would assign \u{1}
width 1 (and thus also break the offending Ratatui test); the change in the width of the control characters actually brought things closer to UAX 11. That being said if you want an API based on "simple" rules that don't change much, UnicodeWidthChar
is likely the best choice:
use unicode_width::UnicodeWidthChar;
pub fn dumb_width(s: &str) -> usize {
s.chars().map(|c| c.width().unwrap_or(0)).sum()
}
This crate implements Unicode standards.
This is the core of the issue. Terminal emulators assign semantics to the control characters that go beyond what Unicode specifies. This is totally correct and fine (extra semantics introduced by higher-level protocols is what the control characters are for). But it means that if your application interprets control characters according to such a higher-level protocol, you need to do that interpretation and processing before you pass your data to a Unicode library like unicode-width
.
@Jules-Bertholet since it's taken me so long to get around to writing this, would you be interested in adding documentation to the crate and readme that explain how and why this crate can only ever be an approximation, and then reference that in the individual APIs as a caveat when we mention "displayed width".
Given a lot of people interpret unicode-width
as "terminal width for this Unicode data (or however many cells it should take up in a cellular grid), or near enough" it seems like it would be germane to offer specific examples that address that exact use-case.
We have a pretty long issue that doesn't really have a resolution at https://github.com/ratatui-org/ratatui/issues/75 about what exactly to do with working out the character widths. It seems at least where the terminal is concerned there's no real solution that solves all the possible issues (and incompatible terminal implementations). As such until someone invests enough time in working through those issues, we're sticking with unicode-width for the purpose of working out where to draw cells (even if it's the wrong thing sometimes). I suspect this will be probably be forever the case.
I note there's also https://crates.io/crates/unicode-display-width, but even that has some problems that would take time to investigate to work out whether it's "better"
As such until someone invests enough time in working through those issues
yeah, people are working on this upstream at Unicode but progress is slow (https://www.unicode.org/L2/L2023/23194-text-terminal-wg-report.pdf), and it would still not result in an algorithm that solves all the problems since terminals are inconsistent (it would however give terminals better guidance on how to be consistnt).
we're sticking with unicode-width for the purpose of working out where to draw cells (even if it's the wrong thing sometimes). I suspect this will be probably be forever the case.
Sure, but please recognize that if this crate isn't precisely what you're looking for, this crate may do things that diverge from your expectations, and your tests may occasionally break.
It is definitely not the intent of this crate to provide some stable displayed width algorithm because such a thing does not exist.
Also - despite my comment about the solution being a poor one overall, it's definitely the right solution for Ratatui. Our tests should have used u+200B when dealing with zero-width character handling instead of \0 and u+0001, so despite the breakage, it was a breakage that caught an assumption of ours that was incorrect.
I can definitely see how that sort of thing makes sense to just release as a point release. In Ratatui's case have done something similar for a rendering change, but in doing so, we communicated the change in our changelog and breaking changes docs, and these flow through to the release notes of the release on GitHub, and from there also to any PRs raised by dependabot in downstream repos. These wouldn't be hidden away in git commits and PR comments like they are here.
documentation to the crate and readme that explain how and why this crate can only ever be an approximation, and then reference that in the individual APIs as a caveat when we mention "displayed width".
The README already has such a warning, which #53 modifies, but we should probably add something to the Rustdoc yes. And also some notes about terminal emulators
I note there's also https://crates.io/crates/unicode-display-width, but even that has some problems that would take time to investigate to work out whether it's "better"
Here are the important differences:
unicode-display-width
properly handles emoji skintone modifiers and ZWJ sequences, this crate doesn't yet (but #53 changes that)unicode-display-width
will treat spacing combining marks as non-advancing in most cases, while unicode-width
usually gives them non-zero width. For some scripts this makes unicode-display-width
unambiguously worse, for others both approaches are differently bad.unicode-display-width
doesn't handle ligatures that extend beyond a single grapheme cluster; unicode-width
currently handles one, and #53 adds a lot more (though it's impossible to be "perfect" here).unicode-display-width
will assign non-zero width to defective combining character sequences and nonstandard Korean jamo sequences, while unicode-width
usually assigns width 0 to these. unicode-display-width
's behavior is unambiguously more correct according to the text of the Unicode Standard. However, these are edge cases that should never occur in "properly formatted" text; and the unicode-width
behavior is more performant, and also leads to generally better results when calculating width character-by-character with UnicodeWidthChar
. (Terminal emulators will likely behave closer to the unicode-width
behavior)I note there's also https://crates.io/crates/unicode-display-width, but even that has some problems that would take time to investigate to work out whether it's "better"
Here are the important differences:
Thanks for documenting those. I think this probably makes unicode-width the right choice at least for now.
It looks like the changes in v0.1.13 are breaking a number of crates, not only ratatui. The CI we have in Fedora Linux started showing red for some of our Rust packages after we updated unicode-width to version 0.1.13:
There might be more, the CI hasn't yet checked all 3000 Rust packages we have, and quite a large number of them depend on unicode-width (~300).
This probably won't change your mind about "this is not a breaking change", but it should at least give some food for thought that the change here might have affected more projects than initially assumed.
good to know, thanks. In general yes, I maintain that changing behavior in such a way is not a breaking change, and I absolutely expect such tests to break; that's what tests are for. A fair number of packages use this crate for estimating displayed terminal width and if they test this behavior they can expect certain tests to not work as the crate is improved.
tui is definitely EOL - it's why ratatui exists. That's a pretty big blast radius as it effects every app built with the library (341 reverse deps on crates.io, 480 packages / 11,416 repos on github).
Regardless of whether this is a "breaking change", it's a change that breaks things. I'd urge you to please consider relaxing your focus on technical correctness here and focus on the actual impact.
I found an example where this change definitely results in a bug - the width of the \n
character is now considered to be 1 instead of 0. Giving control characters a width of 1 instead of 0 might make sense for most control characters, but it definitely doesn't for characters like \n
, which are handled by all terminals, and definitely not printed as width-one "garbage" characters or whatever.
I found an example where this change definitely results in a bug - the width of the
\n
character is now considered to be 1 instead of 0. Giving control characters a width of 1 instead of 0 might make sense for most control characters, but it definitely doesn't for characters like\n
, which are handled by all terminals, and definitely not printed as width-one "garbage" characters or whatever.
To be clear, this is based on the spec - see #45 and read the parts of the unicode spec that are relevant (Those bits are not as dense as some). Unicode 0.1.12 used a canonical display definition of what a control character's width was, unicode 0.1.13 uses the unicode spec display fallback definition and states that this is a compatible non-breaking change with 0.1.12. Regardless of whether this is correct or not, it's a change that significantly breaks code that relies on the previous behavior.
More succinctly stated:
good to know, thanks. In general yes, I maintain that changing behavior in such a way is not a breaking change, and I absolutely expect such tests to break; that's what tests are for. A fair number of packages use this crate for estimating displayed terminal width and if they test this behavior they can expect certain tests to not work as the crate is improved.
Crates should not generally be expected to test the behavior of a dependency like this. The broken tests you're seeing are likely just a small portion of broken behavior. In our case, our tests were wrong(-ish), and have since been fixed. But this means now we don't have any tests that would have picked up the behavior change and so the change is felt by the users of our library. The changes will affect the behavior of every single package that transitively relies on this package. With 300k daily downloads are you really sure that breaking this behavior is a net positive without notifying any of the packages that this changed?
I'd urge you to please consider relaxing your focus on technical correctness here and focus on the actual impact.
This misunderstands my position. I think the "impact"; "some people's tests are breaking", is quite acceptable, and even to some extent desired. If crates are testing for this kind of behavior, I want these tests to break, I want people to notice. A broken test does not automatically mean it is the upstream crate's problem, in this case the existence of the test may imply a certain model about this crate that is incorrect, and even if it isn't incorrect it implies the crates care about this in a way that is incompatible with what the crate does.
Ironically, when @Jules-Bertholet started making these changes I was somewhat against them because I wanted to stick to the original model that this crate only did East Asian Width. However, it was pretty clear to me over years issues filed on this repo by people buiding terminal stuff, that most users of this crate wanted this to handle more than just East Asian Width. It's clear from ratatui's usage that that's the way y'all use this crate too!
So honestly it's baffling to me that such a big deal is being made about what boils down to tests breaking. I am interested in knowing about it, and understanding the problems involved, but I so far have not been convinced of the severity of the issue, and feel that in general these changes serve the terminal crate community better.
As I've said before we are definitely lacking documentation on this kind of thing and I've wanted to improve it for a while, a lot of this is good signal for that. But I'm not convinced this is a large or undesirable impact.
Crates should not generally be expected to test the behavior of a dependency like this.
I mean, no, I think testing this kind of behavior is typically incorrect, but if you choose to do so, then the tests being broken seems like a fine signal.
(typically incorrect, with the caveat that I have definitely in the past found value for tests for behavior like this with the foreknowledge that such behavior may change, where I want to make sure the test broke for a "good reason" not a "bad reason")
The changes will affect the behavior of every single package that transitively relies on this package. With 300k daily downloads are you really sure that breaking this behavior is a net positive without notifying any of the packages that this changed?
yes. and I'm going to state this clearly: at this point you have been quite patronizing, preferring to strongly advocate for specific solutions rather than talk about problems, and repeatedly come off as using shame as a motivator. I have very little desire to deal with this and will not engage if this continues, potentially blocking you.
@decathorpe
I found an example where this change definitely results in a bug - the width of the
\n
character is now considered to be 1 instead of 0. Giving control characters a width of 1 instead of 0 might make sense for most control characters, but it definitely doesn't for characters like\n
, which are handled by all terminals, and definitely not printed as width-one "garbage" characters or whatever.
So, this is kinda nuanced. The standard says to treat ones that do not have a different representation as width 1. \n
does have a different representation, so it is not a slam dunk from the standard that this must be width 1.
However, \n
has a different representation that does not fit within the "what width is this?" ontology in the first place! Any application handling these would have a higher level protocol breaking the string here[^1] and never feeding this character directly to this algorithm. This is going to be true for other line breaks too.
I think we should definitely make a deliberate choice as to behavior here for all newlines, and be consistent (and document it). My guess is that they should go back to being 0 but I would like to see some discussion of that first. I'll file an issue. (edit: filed https://github.com/unicode-rs/unicode-width/issues/60)
[^1]: this crate cannot provide such an algorithm, since there are reasons to break a string that do not come from the contents of its text stream, like wrapping. Plus you have the Windows vs Unix newline issue.
yes. and I'm going to state this clearly: at this point you have been quite patronizing, preferring to strongly advocate for specific solutions rather than talk about problems, and repeatedly come off as using shame as a motivator. I have very little desire to deal with this and will not engage if this continues, potentially blocking you.
If I have offended in any way, please accept my apologies. That was not my intent. In the interest of finding a resolution on this problem, I'm going to bow out and leave it to other interested parties to continue. Thank you for investing your time and energy on this.
In ratatui, running cargo update from v0.1.12 to v0.1.13 results in the latter failing (something has changed in the way that unicode-width handles zero width strings
Error from our tests
--- STDOUT: ratatui buffer::buffer::tests::set_string_zero_width --- running 1 test test buffer::buffer::tests::set_string_zero_width ... FAILED failures: failures: buffer::buffer::tests::set_string_zero_width test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 1557 filtered out; finished in 0.00s --- STDERR: ratatui buffer::buffer::tests::set_string_zero_width --- thread 'buffer::buffer::tests::set_string_zero_width' panicked at src/buffer/buffer.rs:618:9: assertion `left == right` failed left: Buffer { area: Rect { x: 0, y: 0, width: 1, height: 1 }, content: [ "", ], styles: [ x: 0, y: 0, fg: Reset, bg: Reset, modifier: NONE, ] } right: Buffer { area: Rect { x: 0, y: 0, width: 1, height: 1 }, content: [ "a", ], styles: [ x: 0, y: 0, fg: Reset, bg: Reset, modifier: NONE, ] }
I'd suggest yanking 0.1.13 and publishing the latest release as 0.2.0 instead.