Please consider providing a variant that treats grapheme clusters "incorrectly" as many terminals do

joshtriplett commented 1 month ago

Terminals vary in their handling of grapheme clusters, such as ZWJ-based emoji clusters. See https://mitchellh.com/writing/grapheme-clusters-in-terminals for details, and https://mitchellh.com/writing/grapheme-clusters-in-terminals#terminal-comparison in particular for a table comparing terminal behavior.

Handling (for instance) ZWJ-based emoji clusters as a single emoji is the correct behavior, but many terminals do not do this. The "mode 2027" proposal mentioned in the link above would be a way for terminal applications to know and control what the terminal they're running on does with grapheme clusters. However, that proposal isn't widely supported yet.

Please consider providing a variant of the string width function, similar to width_cjk, which people can invoke if they know the terminal they're running on has the incorrect-but-common behavior. Short-term, terminal applications might offer configurability for this; long-term, terminals will hopefully support "mode 2027" and applications can detect/control this support to decide which width to use.

In terms of naming, perhaps something like width_no_cluster_handling? This doesn't particularly need to be a short name; applications that need to handle both cases will presumably create a wrapper that calls one or the other function at runtime, and applications should almost certainly not be hardcoding an assumption of terminals having the incorrect-but-common behavior.

Manishearth commented 1 month ago

I'd be rather surprised if there is a single consistent "incorrect" way of handling things common across terminals. I've seen wildly inconsistent behavior here, and typically some subset of clustering is handled, though rarely emoji.

I'm not really in favor of attempting to implement something like this.

joshtriplett commented 1 month ago

While there are some variations in behavior, the two most common behaviors for emoji are "display each emoji separately, don't display the ZWJ" and "display the single combined emoji". Even just covering those two variations would be helpful.

Manishearth commented 1 month ago

@joshtriplett Not all sequences are equal. I think there's significant variation along these axes:

Whether or not ZWJ sequences are handled
- Whether or not some ZWJ sequences are handled: Some terminals handle older but not newer sequences.
Whether or not skin tone sequences are handled (these aren't ZWJ. often how this is handled is font-dependent)
Whether or not regional indicator sequences — most flags — are handled (these are somewhat correlated with ZWJ being handled, but are not themselves ZWJ based)

And that's just for emoji. This doesn't even get into how bad this is for other writing systems.

I'd be more open to this if there were some sort of standardized "bad" behavior[^1] for terminals which don't wish to handle this, but there isn't as far as I can tell. wcwidth is a common denominator but it gets a bit messier on top of that; as I understand it it's not just wcwidth that most terminals use. I think if people want wcwidth it would be good for there to be a separate crate providing wcwidth. I'd even be fine linking to such a crate in the docs with appropriate caveats.

This crate already gets a lot of "why does this not perfectly match the behavior of my terminal", and I don't want to try and start approximating broken terminals and have to deal with the fact that the broken terminals are all broken in slightly different ways.

no_cluster_handling is a misnomer for any attempt at this. If you want no_cluster_handling that's .chars().count(). The thing you are asking for is an attempt to draw a line between the types of potential grapheme clusters: there's no preexisting terminology for this, and there's no consistency around this, which makes me very reluctant into wading into the business of giving it a name.

[^1]: Not Unicode, just some kind of written-down behavior that most such terminals seem to conform to.

Manishearth commented 1 month ago

I think it may be okay for this crate to provide a wcwidth() implementation but AIUI wcwidth() is itself dependent on LC_CTYPE.

The rather simple General_Category + East_Asian_Width impl described in the docs of python's wcwidth seems somewhat decent, but again I don't know if this matches actual implementations. I think there are some parts of those docs that diverge from the behavior I have seen.

joshtriplett commented 1 month ago

I think it may be okay for this crate to provide a wcwidth() implementation but AIUI wcwidth() is itself dependent on LC_CTYPE.

Beyond just whether LC_CTYPE is a UTF-8 locale? Because at this point I think it's safe to assume cases not using such a locale aren't worth supporting.

Manishearth commented 1 month ago

@joshtriplett I'm not really sure, is the thing. There's very little actually written about this in any documentation I can easily find.

There definitely is a concept of locale-aware segmentation (most good segmentation is locale-aware, even segmentation from the pre-emoji world). I don't know if implementations tend to reach for that.

(and again, it does not seem to me that terminals solely rely on wcwidth)

unicode-rs / unicode-width

Please consider providing a variant that treats grapheme clusters "incorrectly" as many terminals do #71