muesli / reflow

A collection of (ANSI-sequence aware) text reflow operations & algorithms
MIT License
642 stars 40 forks source link

Replace go-runewidth with uniseg #73

Closed mikelorant closed 4 months ago

mikelorant commented 8 months ago

There are a number of issues with the go-runewidth package that makes it problematic when used with emojis. Working with runes is a flawed approach as explained by the author of uniseg:

That's because a single code point (rune) is not always a complete character. That's the whole basis of the uniseg package. The README explains this in detail. One example given there is the country flag emoji. These emojis must always consist of two runes. What's the width of only one of these runes then? It's basically undefined because it's an incomplete grapheme cluster. The mattn/go-runewidth package gets this fundamentally wrong and I have tried for a long time to help them do it right but there was never much interest in following up on it. https://github.com/rivo/uniseg/issues/48#issuecomment-1904871545

All references to RuneWidth and StringWidth need to be replaced to improve the width calculations when dealing with emojis.

It is trivial to replace StringWidth from go-runewidth to uniseg as they provide the same input and output. There is also a performance increase of 4x migrating to using the uniseg version so there are some very clearly benefits here.

The complexity lies with what to do about the RuneWidth function. The right choice will be stop thinking in runes and switch over to grapheme clusters. This will require logic changes to any function using runes.

From an acceptance criteria perspective, the goals should be:

  1. Remove go-runewidth as a dependency.
  2. No regressions in performance.

We can expect different results for some of the calculations but these should be considered improvements and bug fixes. It is likely this will cause breaking changes to applications that depend on this package. This may warrant a major version bump to reflect that. We should also make it clear in the documentation that the calculations involving some emojis will have changed.

mikelorant commented 8 months ago

Pull request #71 addresses this issue.

As mentioned in the pull request, there are performance regressions we need to look into. This may be a case of increased complexity means more CPU cycles and we may never be able to achieve the original performance, however this needs to be thoroughly investigated before we accept these performance reductions.

mikelorant commented 8 months ago

@muesli Any guidance you can provide on how we should tackle the breaking nature of these changes?

@rivo Seeking any help you can provide in helping us track down the performance issues.

@meowgorithm This will have a ripple effect on many of the other charmbracelet libraries and applications. I expect regressions to occur in bubbles and lipgloss. Once we work through this one, we can then move onto those repositories and apply similar changes.

mikelorant commented 8 months ago

I believe all the work for this task is complete and we just need some review of the pull request. Hoping @muesli will be able to provide his feedback to help move this work closer to being merged.

mikelorant commented 8 months ago

@muesli Any further thoughts regarding this? All direct references to go-runewidth have now been removed from Lipgloss, however it still has an indirect reference due to reflow which pulls in this package. This is now the blocker for having accurate emoji support in Bubbletea apps.

Heads up for @meowgorithm that we are making serious progress in dealing with emoji and unicode alignment issues. Nearly there 🤞

muesli commented 8 months ago

Nearly there indeed! I'm currently testing your changes with a bunch of "historically" problematic (edge) cases and in various terminals. Thanks for the all the effort you put into this.

mikelorant commented 8 months ago

Is there anything I can do to help? I realise there is still a lot more work to do to get the Charm suite of tools working better with Unicode. I feel this is only the beginning of the process and know I likely have to move onto many more repositories to do changes for.

Feel free to allocate any of the "wood chopping" work. Even if it is just adding a ton of test cases, I am willing to step in and pick up some of the boring but necessary work.

mikelorant commented 8 months ago

Regarding your edge cases with terminals, have you seen these 2 blog entries?

https://www.jeffquast.com/post/ucs-detect-test-results/ https://www.jeffquast.com/post/terminal_wcwidth_solution/

This all leads back to this repository: https://github.com/jquast/ucs-detect/

The results of terminals are listed here: https://ucs-detect.readthedocs.io/results.html

You can test to the results directly against the terminal with this: https://github.com/jquast/wcwidth/blob/master/bin/wcwidth-browser.py

Unfortunately, it has become obvious that while many libraries are doing a great job of evaluating printable widths, the terminals themselves have been causing issues as well. Hopefully the results of ucs-detect will help terminal developers improve their results.

rivo commented 8 months ago

Looks like @jquast reinvented grapheme clustering in his "Specification". Unfortunately, he's also missing some things. For example, U+FE0E (VS15) changes the width of an emoji from 2 to 1:

image

From what I can see, his algorithm would return 2 for both of these characters.

I don't blame him, though. Even Chrome (at least on macOS) gets this wrong, Firefox renders it correctly. iTerm2 (which I used for this screenshot) renders the emojis correctly but also assigns a width of 2 to both of them.

As you noted, and as evidenced by his study results, there are lots of variations among terminals. This is the result of everyone inventing their own algorithm or simply copying such an invention (many of them the original wcwidth implementation). Unfortunately, many of them simply ignore Unicode standards and try to approximate them differently (e.g. by just using the EastAsianWidth property).

I had a discussion about the example above elsewhere with the authors of the VS Code editor:

That might be reasonable, but it is explicitly contrary to the EastAsianWidths specification. You also suggest that Pictographics fullowed by text-presentation should have width 1. That seems reasonable, though I don't implement that.

They find my suggestions "reasonable" but then still refuse to implement them. Unless the Unicode team publishes a formal specification for character widths, we will always have this messy and, shall I say, ugly landscape.

jquast commented 8 months ago

Hello @rivo, thanks for attention at this matter.

You are right, that I should return a final width of 1 for U+231B, U+FE0E, as suggested by https://unicode.org/Public/15.1.0/ucd/emoji/emoji-variation-sequences.txt

I will add this to the specification and add a VS-15 test to the ucs-detect tool. Thank you.

mikelorant commented 4 months ago

Closing this issue as this problem has been resolved for the Charm suite of tools by the creation of the new ansi package. This package has already replaced most references to reflow in Bubble Tea,Lip Gloss and Bubbles.