fix: unicode truncation bug

joshka commented 2 months ago

Rewrote the line / span rendering code to take into account how multi-byte / wide emoji characters are truncated when rendering into areas that cannot accommodate them in the available space
Added comprehensive coverage over the edge cases
Adds a benchmark to ensure perf

Fixes: https://github.com/ratatui-org/ratatui/issues/1032

joshka commented 2 months ago

I'd suggest that for this PR it's probably worth:

fixing up any obvious simplifications / problems I've missed
disabling the broken tests (unless there's a really obvious test fix for the right alignment one)
merging this and making a 0.26.3 release pretty soon.
If there's an easy and obvious way that unicode-truncate helps simplify this, then we should consider that

Note this stems from https://github.com/ratatui-org/ratatui/pull/1082 credit goes to @EdJoPaTo for making it easier to see a way to make a fix for this.

Side note: there's also a truncate crate, which does various unicode split_at things, but it uses unicode-segmentation instead of unicode-width for calculating positions - there's some subtle incompatibilities with the approaches. The ideas / methods might be useful to consider for unicode-truncate however.

EdJoPaTo commented 2 months ago

Side note: there's also a truncate crate, which does various unicode split_at things, but it uses unicode-segmentation instead of unicode-width for calculating positions - there's some subtle incompatibilities with the approaches. The ideas / methods might be useful to consider for unicode-truncate however.

I also suggested unicode-segmentation for unicode-truncation as it currently rips apart stuff that belongs together like Emoji combinations (Flags for example): https://github.com/Aetf/unicode-truncate/pull/11

joshka commented 2 months ago

Side note: there's also a truncate crate, which does various unicode split_at things, but it uses unicode-segmentation instead of unicode-width for calculating positions - there's some subtle incompatibilities with the approaches. The ideas / methods might be useful to consider for unicode-truncate however.

I also suggested unicode-segmentation for unicode-truncation as it currently rips apart stuff that belongs together like Emoji combinations (Flags for example): Aetf/unicode-truncate#11

There's an EXTREMELY lengthy debate on to use / not use segmentation in https://github.com/ratatui-org/ratatui/issues/75 (ratatui's longest open issue). I don't know which side of that debate has more merit, and it's likely that using unicode-segmentation within the truncation crate would break things in other ways for many terminals / use cases.

joshka commented 2 months ago

There's an EXTREMELY lengthy debate on to use / not use segmentation in #75 (ratatui's longest open issue). I don't know which side of that debate has more merit, and it's likely that using unicode-segmentation within the truncation crate would break things in other ways for many terminals / use cases.

Also, I forgot that we do use segmentation when doing grapheme splitting, so maybe that is the right abstraction. I'm unsure how exactly to test an edge case on this one without delving a vast amount more into unicode than I want to right now.

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 94.1%. Comparing base (aa4260f) to head (d415fe9). Report is 1 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #1089 +/- ## ===================================== Coverage 94.1% 94.1% ===================================== Files 61 61 Lines 14619 14664 +45 ===================================== + Hits 13764 13809 +45 Misses 855 855 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

joshka commented 2 months ago

I've updated the tests so that the document the current behavior even if it's wrong (as noted in the first post). This is better than panicking, so I'd suggest merge, release, and keep a bug open to fix the more minor issue.

kdheepak commented 2 months ago

The code is very readable and merging this even with bugs is good to me! I like that there are tests that capture the behavior.

I don't quite understand why this works (haven't had a lot of time to go through it in detail). Maybe the PR description can have a high level description of why this works?

Unrelated to this PR but it would nice to collect links to resources on the developer guide sector of website about Unicode that for uninitiated potential contributors to read.

joshka commented 2 months ago

I don't quite understand why this works (haven't had a lot of time to go through it in detail). Maybe the PR description can have a high level description of why this works?

Added an algorithm comment to help in the code

EdJoPaTo commented 2 months ago

Added heavily inspired by unicode-truncate code to find the index in order to get the starting index. This allows for taking a reference rather than cloning into an ever-more-allocating String.

I think this would be a great addition for unicode-truncate especially with all the related test cases specific for this.

Some test cases expect " c" with a space in the beginning where the multi width Unicode was. This requires padding → Owned data → has a bigger performance impact. But it's something for step 2, ideally even solved within unicode-truncate too.

(Mentioning @Aetf here as they might be interested in this discussion)

Did not remove the split_at method which obviously annoys the CI.

Benchmark after the change. (Results on a Raspberry Pi 3 are similar while taking ~10 times the time)

line_render/0           time:   [82.312 ns 82.321 ns 82.329 ns]
                        change: [-13.630% -13.444% -13.279%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 148 outliers among 1000 measurements (14.80%)
  100 (10.00%) low severe
  3 (0.30%) low mild
  16 (1.60%) high mild
  29 (2.90%) high severe
line_render/3           time:   [179.75 ns 179.88 ns 180.00 ns]
                        change: [-26.027% -25.958% -25.891%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 1000 measurements (0.40%)
  3 (0.30%) high mild
  1 (0.10%) high severe
line_render/4           time:   [202.37 ns 202.48 ns 202.60 ns]
                        change: [-24.577% -24.503% -24.432%] (p = 0.00 < 0.05)
                        Performance has improved.
line_render/6           time:   [262.62 ns 262.75 ns 262.88 ns]
                        change: [-20.892% -20.719% -20.522%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 1000 measurements (0.50%)
  3 (0.30%) high mild
  2 (0.20%) high severe
line_render/7           time:   [281.92 ns 282.05 ns 282.20 ns]
                        change: [-20.378% -20.302% -20.212%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 25 outliers among 1000 measurements (2.50%)
  19 (1.90%) high mild
  6 (0.60%) high severe
line_render/10          time:   [359.13 ns 359.72 ns 360.49 ns]
                        change: [-13.931% -13.808% -13.652%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 32 outliers among 1000 measurements (3.20%)
  20 (2.00%) low mild
  5 (0.50%) high mild
  7 (0.70%) high severe
line_render/42          time:   [518.82 ns 518.92 ns 519.03 ns]
                        change: [-1.9234% -1.8597% -1.7990%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 1000 measurements (0.60%)
  1 (0.10%) low mild
  1 (0.10%) high mild
  4 (0.40%) high severe

EdJoPaTo commented 2 months ago

Some test cases expect " c" with a space in the beginning where the multi width Unicode was. This requires padding → Owned data → has a bigger performance impact. But it's something for step 2, ideally even solved within unicode-truncate too.

Oh… it actually doesn't in our use case as we can just increase the x += n and continue render further right (Buffer::set_stringn does something similar).

EdJoPaTo commented 2 months ago

I think there might be some general assumed area.x == 0 in here due to code that uses the span_offset. This works for the tests if the line is rendered at x=0 but i think it might break when that's not the case (for the partial rendered span part of the line at least).

Yeah, this fails currently: 0a55e7618d2422377fe5b3704a81bbdb06a26c00

joshka commented 2 months ago

All tests passing (and corrected so that the edge cases seem like they're the right result) E.g. right aligned text that does not have enough space to render the first emoji will not overwrite any first character(s) in that position.

I think this is probably good to go now.

Also removed the split_at / skip functions from span. We should consider adding those later as a separate PR in a tested, documented, public manner.

joshka commented 2 months ago

There's another possible simplification (for future consideration on this) that teases apart the span selection and span rendering. Something like (pseudo code):

fn render_ref(...) {
  for span in self.visible_spans(...) {
    span.render_ref(area, buf)
    let area = area.indent(span.width())
    if area.is_empty() { break }
  }
}

fn visible_spans(...) -> Iterator<&Span> {
  skip non-visible spans
  chain!(truncated first span, rest of spans)
}

I experimented briefly with this, but I think it's only possible if visible_spans returns owned Spans instead of refs due to the truncation - it's worth another try (in a future PR).

joshka commented 2 months ago

I'll wait on @EdJoPaTo's approval to merge this

EdJoPaTo commented 2 months ago

Out of curiosity I enabled all pedantic and nursery lints for the affected files. They were annoyed about possible truncations and I found a bug with long lines because of that. Reason: When .width() as u16 doesn't fit inside u16 this gets wrong. Doesn't panic so at least that's good. Adding a test case and fixing it.

(Should probably continuing my journey of refactoring ratatui to enable more and more lints)

EdJoPaTo commented 2 months ago

The logic is somewhat horribly complex as a lot of stuff isn't intuitive… The widths are usize and not u16 as the truncation of the end is implicitly done. That resulted in misunderstandings in what should happen. I don't think we can fully explain the code by comments. We need many tests to ensure it actually does what it is expected to do. Lines/Spans longer than u16 for example is a whole new mess which I haven't thought about before.

Also, I refactored the code to iterator logic. I'm fine with reverting it. Benchmarks suggest it's similar in performance. But it might be even harder to follow what is going on?

joshka commented 2 months ago

Some final tweaks:

It's important to look the absolute magnitude of a perf gain, and not just the relative amount. On my M2 mac, removing the fast path costs ~30ns. For an app running at an absolutely absurd 1000FPS, the app would now run at 999.97 FPS. This isn't worth keeping the code to optimize.

Remove perf fast path for left aligned lines as the absolute perf diff is negligible (30ns on an M2 Mac)
move the visible spans calculation to a method that returns an iterator over the spans that are visible
replace truncation clippy lints with converions that clamp to u16::MAX
replace arithmetic_side_effects clippy lint with saturating_sub
remove debug_asserts
rewrite comments in the precise imperative mood ("when this, then that", rather than "if this then it's necessary / we only need to")

joshka commented 2 months ago

I'd like to get this released (perhaps Monday if @orhun is up to it).

Direct link to the current file as of the last commit: https://github.com/ratatui-org/ratatui/blob/43118c5494f5218332319a41eb742a51138d7e3c/src/text/line.rs#L555

It somewhat worries me that new test cases pop up and break stuff even when we thought we are done and only need documentation stuff. But it's way better than the current state of random panics.

I'm not too worried - most text that's rendered in the context of a tui app isn't > u16::MAX, and when it is the resultant bug is easy to fix by pre-trimming the text instead of doing this in the rendering. (There's an actual bug in turbo that is similar to this problem, but with Rect::area).

The logic is somewhat horribly complex as a lot of stuff isn't intuitive… The widths are usize and not u16 as the truncation of the end is implicitly done. That resulted in misunderstandings in what should happen. I don't think we can fully explain the code by comments. We need many tests to ensure it actually does what it is expected to do. Lines/Spans longer than u16 for example is a whole new mess which I haven't thought about before.

The complexity of the implementation is now pretty much aligned with the inherent complexity of the problem, and it's documented well enough.

The approach of debug_asserts is somewhat I like here as it ensures that assumptions are actually true and documents these assertions in the process. And they only panic on debug, so production code will only be displaying stuff wrong, not panicking. Also, we have many test cases and users of ratatui will provide more when they run their stuff as dev builds.

I replaced these with infallible code. I don't like debug_asserts at all, especially in a library. When you're wrong and they fail, you're slapping your user's users with a pain point which is hard to report, and which takes a release of both projects in order to fix.

EdJoPaTo commented 2 months ago

I'm not too worried - most text that's rendered in the context of a tui app isn't > u16::MAX, and when it is the resultant bug is easy to fix by pre-trimming the text instead of doing this in the rendering.

I am not worried about the actual bug that came up. I am more worried that 3 people looked over it for days and missed yet another bug.

I replaced these with infallible code.

What I liked about it were the assumptions actually tested. We wrote code with assumptions which are now hidden behind infallible code. But with all the tests, fine by me.

Remove perf fast path for left aligned lines as the absolute perf diff is negligible (30ns on an M2 Mac)

I am not at all worried about way overpowered hardware. I know that mqttui is used on Raspberry Pi, personally I regularly use mqttui on Raspberry Pi 1.

Line is one of the core constructs used by a lot of things. Relatively small improvements will impact a lot of code depending on it. So it's way more significant to improve the performance here. And also the Alignment::Left is the default for most things so it has the most impact.

Even when it's not about lower power devices or hardware that is years old (My main device is ~9 years old now). It saves CPU cycles and therefore battery life.

EdJoPaTo commented 2 months ago

To pick even more on the performance fast path of Alignment::Left on slower devices…

Let's imagine a table rendered on a 100x100 terminal. Assume 5 columns and 90 rows. The table cell used Text, so many Lines. More cells vs. more lines per cells are probably irrelevant.

90 * 5 means 450 visible Cells to be rendered. The worst case time in the benchmark below is the Buffer wider than the Line width with 9.7 µs. Let's assume 10 µs for ease of calculation. 10 µs * 450 = 4.5 ms. 60 FPS are 16.66 ms. So we took 1/4 of that for the rendering of lines alone.

And this is a Raspberry Pi 2 benchmark, I use mqttui on Raspberry Pi 1 regularly.

Even an improvement of 12% is really good here already. The benchmark goes to a regression of 36%. So this is definitely significant. For the default of left alignment which is the most often used one.

Benchmark on Raspberry Pi 2:

line_render/Left/3      time:   [2.7422 µs 2.7427 µs 2.7433 µs]
                        change: [+34.945% +36.396% +37.329%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 68 outliers among 1000 measurements (6.80%)
  48 (4.80%) high mild
  20 (2.00%) high severe
line_render/Left/4      time:   [2.9284 µs 2.9290 µs 2.9297 µs]
                        change: [+33.260% +33.549% +33.885%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 23 outliers among 1000 measurements (2.30%)
  13 (1.30%) high mild
  10 (1.00%) high severe
line_render/Left/6      time:   [4.7153 µs 4.7158 µs 4.7164 µs]
                        change: [+23.409% +24.391% +24.993%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 63 outliers among 1000 measurements (6.30%)
  30 (3.00%) high mild
  33 (3.30%) high severe
line_render/Left/7      time:   [4.9508 µs 4.9513 µs 4.9518 µs]
                        change: [+23.962% +24.193% +24.427%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 44 outliers among 1000 measurements (4.40%)
  24 (2.40%) high mild
  20 (2.00%) high severe
line_render/Left/10     time:   [7.0004 µs 7.0016 µs 7.0030 µs]
                        change: [+16.254% +16.648% +16.980%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 46 outliers among 1000 measurements (4.60%)
  21 (2.10%) high mild
  25 (2.50%) high severe
line_render/Left/42     time:   [9.7193 µs 9.7207 µs 9.7224 µs]
                        change: [+12.279% +12.559% +12.887%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 131 outliers among 1000 measurements (13.10%)
  57 (5.70%) low mild
  35 (3.50%) high mild
  39 (3.90%) high severe

Benchmark of Raspberry Pi 4: regression of 14.3 to 38.7%. Worst case takes ~2.03 µs, so ~5 times faster than Pi 2.

I will not benchmark this on Pi 1 as running the benchmarks on Pi 4 take ~10 min, on Pi 2 ~40 min. Pi 1 is single core in comparison to Pi 2 with 4 cores.

joshka commented 2 months ago

10 µs * 450 = 4.5 ms. 60 FPS are 16.66 ms

Let's look at the absolute magnitude of the change though: (10 * .125) = 1.25us per call

1.25*450 = 0.563ms, so each frame now takes (16.667+0.563) = 17.23ms = 58 FPS. That's measurable sure, but rarely directly noticeable. Assuming a factor of 4 on the Pi 1, 18.92ms = 53 FPS.

One thing I did note when removing the fast path was that the slow paths got slightly faster (1-4%), so the tradeoff is one that doesn't necessarily have a definitively good answer.

For now, I'd like to keep it simplified, and keep this PR about fixing the bug and not trying to eek out all the performance of this. You've convinced me that there are some parts of this that matter to you, so let's continue the discussion about perf in an issue / forum topic if it's more a discussion.

Pi 1 is single core in comparison to Pi 2 with 4 cores.

This raises an interesting point. I wonder (not for this PR), what parts of our rendering pipeline might be able to be done more in parallel.

ratatui-org / ratatui

fix: unicode truncation bug #1089

Codecov Report