w3c / input-events

Input Events
https://w3c.github.io/input-events/
Other
24 stars 16 forks source link

"character" is not defined #73

Closed r12a closed 1 year ago

r12a commented 7 years ago

[from Addison Phillips]

https://w3c.github.io/input-events/#interface-InputEvent-Attributes

In section 5.1.2 there are multiple places where the term "character" is used without definition. It would be better to clearly define this to mean a Unicode code point.

johanneswilm commented 6 years ago

@r12a @aphillips sorry for late reply, somehow I had missed this.

I am ok with defining the term character. But I cannot find any appropriate definition of the term in the W3C repositories which doesn't use the word "character" as explanation for what that is. And clearly we cannot link to that, because such a definition would be circular. The definition on Wikipedia makes the term code point even broader: "Many code points represent single characters but they can also have other meanings, such as for formatting." [1]

[1] https://en.wikipedia.org/wiki/Code_point

xfq commented 6 years ago

FWIW, in Infra:

A code point is a Unicode code point and is represented as a four-to-six digit hexadecimal number, typically prefixed with "U+". [...] Code points are sometimes referred to as characters and in certain contexts are prefixed with "0x" rather than "U+".

johanneswilm commented 5 years ago

Based on the meeting at TPAC, we are waiting for a a suggestion on how to adjust the explanatory note text from @r12a .

aphillips commented 3 years ago

Updating this issue as part of I18N's regular clean-up cycle. There is now a definition in the spec:

https://w3c.github.io/input-events/#definitions

This defines "character" as:

A character is an extended grapheme cluster. [UAX29]

I'm not sure that this is what is intended, given that some input events (backwards deletion, certain cursoring operations) may be on a code point basis. This needs a read-through to determine. In addition, it looks like we owe some text based on a meeting at TPAC. I'll update our tracking issue to needs attention and add it to our action list.

johanneswilm commented 3 years ago

@aphillips See also previous discussion here: https://github.com/w3c/input-events/issues/71#issuecomment-399820566 .

aphillips commented 3 years ago

@johanneswilm I didn't really re-read Input Events this morning when making comments--relying on memory can thus be tricky. Cursoring/selection changes are something I know we've talked about somewhere, but perhaps not in input events :-)

For backward deletion without an IME, yes: generally speaking backwards deletion works on a code point basis. Try a sequence like U+0061 U+0300 (à). Even simple editors like Notepad will delete the accent separately from the base letter when using backspace (even though you cannot select them separately). This is, of course, only true for denormalized input. U+00E0 (à) deletes as a single code point.

Languages such as the Indic ones that rely/require combining marks depend on this behavior for users to be able to correct typos. Of course, some of these also use IMEs.

johanneswilm commented 3 years ago

@aphillips You are right, but after rereading that discussion, I believe we were aware of this difference at the time we included the definition. We only use the definition of "character" for the "insertTranspose" input type, in which case it really is switching two characters and it's not ever on code point basis.

But I might be wrong. At any rate, I think the last we officially heard was that we would receive a PR from @r12a so if we can get that now, that would be preferable.

aphillips commented 3 years ago

@johanneswilm I'm working on getting that PR (or at least evaluating if more work is needed) from I18N (probably @r12a or I) but I think it'll probably be at least a few days while we remind ourselves of where we left this. Transpositioning of characters should be done on a grapheme cluster basis for sure. Stay tuned.

aphillips commented 2 years ago

Reviewing this today (2022-03-07) it appears we didn't put in a PR. I have reviewed the current WD: @johanneswilm's description is correct. The term character is only used once in the document, in the insertTranspose function.

The I18N WG is admittedly pedantic about character encoding jargon. In this case, the meaning of "character" is intended to be a "user-perceived character", aka a grapheme or grapheme cluster. I would suggest:

  1. Remove the definition of character from the Terminology section, since it is only used on the one time in the entire document. This will avoid future revisions accidentally using the term in a different way.

  2. Replace the term 'character' in insertTranspose with the term grapheme, linking from the [I18N-GLOSSARY]. (We created the I18N glossary since the last comments on this thread and it's specref referenceable)

Would you prefer a PR for this?