rivo / uniseg

Unicode Text Segmentation, Word Wrapping, and String Width Calculation in Go
MIT License
585 stars 61 forks source link

Improve emoji consistency with older terminals #57

Open mikelorant opened 4 months ago

mikelorant commented 4 months ago

For accurate rendering of emojis, it is important that both the terminal and the library are consistent in determining the width of emojis. Unicode 14 clarified that the width of emoji presentation using variation selector 16 was double width. This change has created a problem.

Many terminals (such as macOS terminal, Alacritty, Hyper, VS Code) do not support the latest Unicode standard. This means they may not display newer grapheme clusters correctly. However an even bigger problem is they render older emojis different especially those using variation selector 16.

One of the guiding principles of uniseg is @rivo's aim for perfection. uniseg feels like a reference implementation and helps identify problems with other implementations. But users don't care about perfection, they care about compatibility.

This puts uniseg in a difficult place. While uniseg is correct, from a compatibility perspective, it is providing different results to the majority of terminals. This creates a poor experience as the only option is to tell the developer of the terminal to upgrade their handling of Unicode. With many terminals dependent on xterm.js this is nearly impossible for them to fix.

Can we find a way to support older terminals without trying to support multiple Unicode versions?

A global option to override how variation selector 16 is handled is obviously one approach. The precedent has already been set with EastAsianAmbiguousWidth. It doesn't change much but would solve the biggest rendering difference. iTerm2 provided an advanced option specifically for this case. WezTerm has an option for choosing which Unicode standard is used.

Another approach (which I have created a proof of concept) is the ability to override the result for specific code points. By implementing this as a global (with all the downsides as well), this allows all dependencies that rely on uniseg to provide the same consistent results. While my implementation is crude, it should demonstrate this idea. Benefits being that this would shift compatibility overrides to applications that use uniseg - the tech debt stays out of uniseg.

I'd certainly understand if this was marked as WONTFIX, clearly this is not a problem with uniseg, but it is something that creates problems for Go applications that rely on uniseg but have users on older terminals. Maintaining my fork with the hack is fairly easy, but felt an issue to discuss this might be appropriate.

ohir commented 3 months ago

@mikelorant

A global option to override how variation selector 16 is handled is obviously one approach.

As an application author I would suggest to whiners to raise an issue with their terminal (or OS) provider to get them update their product. Users should be steered away from software, esp. terminals, that do not conform to the estabilished standards. Full stop. This includes projects that lag on updating from the newly published UCD tables (what should by done by the CI script anyway).

2¢, TC :)

mikelorant commented 3 months ago

@ohir You are absolutely correct, however when dealing with macOS Terminal, that is one big gorilla to deal with 😢

Be aware, that the majority of Terminals actually don't handle the Unicode 14 clarifications. There are more wrong than right terminals. That's a lot of issues to create and in the meantime, we have a library our applications use deviating from how many Terminals render. Being right isn't always the best result.

ohir commented 3 months ago

There are more wrong than right terminals. That's a lot of issues to create and in the meantime, we have a library our applications use deviating from how many Terminals render.

I am aware of that sad state of affairs. But workarounds (and chores) should belong to the user of the standards implementing lib - in theirs app or a wrapper lib.

we have a library our applications use deviating from how many Terminals render.

No. We have a library that properly implements standard others have got wrong. Many wrong implementations call for a correcting wrapper, not for stuffing corrections into the lib that rightously (afaik) claims compliance.

mikelorant commented 3 months ago

You make many good points. I can't disagree with anything you are saying.

The idea of a correcting wrapper is definitely something worth discussing further, how could something like this be implemented? I think this is basically what I am after.

ohir commented 3 months ago

how could something like this be implemented?

It depends whether you're after emojis only or after a full standards display (ie. having indic scripts measured correctly on non-standards compliant input). For the former I'd make an independent emoji's BNF compliant (sub)parser to decide “suspicious” segments plus the final bitmap based filter to decide how this particular sequence will render given os/terminal/UCD version info. Plus likely a bit of corner-cases handling like black cat abomination and (apple's) directional additions.

Though, if speed does not matter, simple map check you proposed can be a simpler and still valid solution. The real hard bit is to fill that map with data – m×n×v for m terminals n of their versions in use and v of 3 or 4 most recent UCD data versions.

That said full corrections wrapper, UCD based, would of course be better. But lets not pollute uniseg issues discussing a wrapper, you can reach me at gmail as ohir.ripe

TC