w3c / input-events

Input Events
https://w3c.github.io/input-events/
Other
24 stars 16 forks source link

Code points and graphemes in backward deletion #71

Open r12a opened 7 years ago

r12a commented 7 years ago

5.1.2 Attributes https://w3c.github.io/input-events/#interface-InputEvent-Attributes

note after #26

In some scripts, backward deletion within a text node with a collapsed selection will delete a code point rather than a grapheme.

Did you mean to have 'code point' and 'grapheme' this way around? In the following note, the text says

In some scripts, forward deletion within a text node with a collapsed selection will delete a grapheme rather than a code point.

Which i suspect may be what was intended in the former note too (?)

[from Addison] Other examples might include things like emoji sequences (with variation selectors) or ZWJ sequences or certain control sequences.

johanneswilm commented 6 years ago

Did you mean to have 'code point' and 'grapheme' this way around? In the following note, the text says

If I remember correctly, the issue mentioned during an Editing Taskforce meeting was that deleting backward and forward did have different effects for certain scripts. But I may remember that incorrectly. Let me investigate.

kojiishi commented 6 years ago

Ah, this one is complicated, and I don't know exact answers yet.

I think we can say "backward/forward deletion may delete more than or less than one extended grapheme cluster defined by UAX#24, including part of a grapheme cluster" but I don't have good confidence to say more.

You should also avoid using "grapheme" without reference. There is no such terminology. UAX#29 defines a grapheme cluster as a "user-perceived character", but the exact definition of it varies because the user perceptions vary. It defines two examples, legacy and extended, as one of the common cases, but Unicode knows that different grapheme cluster boundary definitions are needed in different situations, such as cursor movement, drop caps, etc. There's an old request in Unicode issue tracker to define each variation, but it hasn't got enough attractions yet. So, "grapheme cluster" is probably better term to use than "grapheme" but even so, it still doesn't define what it is without defining the terminology somewhere.

What CSS does is to define "a character in CSS is an extended grapheme cluster as defined by UAX#29" and use this in all the places. As far as I know, there's no such definition in DOM, so each spec should have its own definition.

/cc @yosinch

yosinch commented 6 years ago

It seems just say "platform dependent" and refer to UAX#29 is current solution.

From UAX#29 "Grapheme Cluster Boundaries" section

his document defines a default specification for grapheme clusters. It may be customized for particular languages, operations, or other situations. For example, arrow key movement could be tailored by language, or could use knowledge specific to particular fonts to move in a more granular manner, in circumstances where it would be useful to edit individual components. This could apply, for example, to the complex editorial requirements for the Northern Thai script Tai Tham (Lanna). Similarly, editing a grapheme cluster element by element may be preferable in some circumstances. For example, on a given system the backspace key might delete by code point, while the delete key may delete an entire cluster.

My questions are:

BTW, In Blink,

It seems Backspace behavior is depend on platform and code point characteristics: .e.g 2⃣ Combining Enclosing KEYCAP: Chrome/Edige=Delete both, Firefox=Delete &#x20E3

johanneswilm commented 6 years ago

Do we want to spec Backspace behavior in editing TF?

No Possibly, but not in this spec. I think this note was merely meant as a help for JS authors that they may need to use the getTargetRanges() method and cannot just count on it being exactly one character. I think we should add a note that both notes are not normative.

johanneswilm commented 6 years ago

How about this? Will that make things better? @kojiishi @yosinch ?

johanneswilm commented 6 years ago

Given that I made the notes say they are non-normative, I left out some things such as the remark about emoji sequences, etc. . Do you think it would be better to also mention that? I think the initial aim of including the note was to get makers of JS editors who are not familiar with scripts other than Latin to consider to use getTargetRanges() rather than just relying on it being exactly one character they need to remove.

r12a commented 6 years ago

Actually, this came up in a Unicode/WG2 meeting last week. It seems that in general what you had is correct: backspacing over text removes codepoints, whereas forward delete removes graphemes. I think there may be some differences depending on platform/application.

(The backspacing behaviour avoids deleting and having to retype a sometimes longish sequence of characters when you mistype something, and allows you to delete something inside a cluster. The forward delete was explained as avoiding a situation where only a (possibly hard to see) combining character remains.)

johanneswilm commented 6 years ago

@r12a Thanks! I reverted that part so it should be as it is in the current version.

johanneswilm commented 5 years ago

Based on the meeting at TPAC, we are waiting for a a suggestion on how to adjust the explanatory note text from @r12a .

kojiishi commented 5 years ago

Maybe "different number of code points" is better than saying it's"single code point"?

Mac/iOS has an API to get the number of code points for backspace for the given string.

rniwa commented 5 years ago

@litherum @whsieh

r12a commented 5 years ago

Maybe "different number of code points" is better than saying it's"single code point"?

Or "one code point or a smaller number of code points than an entire grapheme cluster" ??

r12a commented 4 years ago

The i18n WG is satisfied with the resolution of this issue.