Closed r12a closed 1 year ago
@r12a @aphillips sorry for late reply, somehow I had missed this.
I am ok with defining the term character. But I cannot find any appropriate definition of the term in the W3C repositories which doesn't use the word "character" as explanation for what that is. And clearly we cannot link to that, because such a definition would be circular. The definition on Wikipedia makes the term code point
even broader: "Many code points represent single characters but they can also have other meanings, such as for formatting." [1]
FWIW, in Infra:
A code point is a Unicode code point and is represented as a four-to-six digit hexadecimal number, typically prefixed with "U+". [...] Code points are sometimes referred to as characters and in certain contexts are prefixed with "0x" rather than "U+".
Based on the meeting at TPAC, we are waiting for a a suggestion on how to adjust the explanatory note text from @r12a .
Updating this issue as part of I18N's regular clean-up cycle. There is now a definition in the spec:
https://w3c.github.io/input-events/#definitions
This defines "character" as:
A character is an extended grapheme cluster. [UAX29]
I'm not sure that this is what is intended, given that some input events (backwards deletion, certain cursoring operations) may be on a code point basis. This needs a read-through to determine. In addition, it looks like we owe some text based on a meeting at TPAC. I'll update our tracking issue to needs attention and add it to our action list.
@aphillips See also previous discussion here: https://github.com/w3c/input-events/issues/71#issuecomment-399820566 .
@johanneswilm I didn't really re-read Input Events this morning when making comments--relying on memory can thus be tricky. Cursoring/selection changes are something I know we've talked about somewhere, but perhaps not in input events :-)
For backward deletion without an IME, yes: generally speaking backwards deletion works on a code point basis. Try a sequence like U+0061 U+0300 (à). Even simple editors like Notepad will delete the accent separately from the base letter when using backspace (even though you cannot select them separately). This is, of course, only true for denormalized input. U+00E0 (à) deletes as a single code point.
Languages such as the Indic ones that rely/require combining marks depend on this behavior for users to be able to correct typos. Of course, some of these also use IMEs.
@aphillips You are right, but after rereading that discussion, I believe we were aware of this difference at the time we included the definition. We only use the definition of "character" for the "insertTranspose" input type, in which case it really is switching two characters and it's not ever on code point basis.
But I might be wrong. At any rate, I think the last we officially heard was that we would receive a PR from @r12a so if we can get that now, that would be preferable.
@johanneswilm I'm working on getting that PR (or at least evaluating if more work is needed) from I18N (probably @r12a or I) but I think it'll probably be at least a few days while we remind ourselves of where we left this. Transpositioning of characters should be done on a grapheme cluster basis for sure. Stay tuned.
Reviewing this today (2022-03-07) it appears we didn't put in a PR. I have reviewed the current WD: @johanneswilm's description is correct. The term character is only used once in the document, in the insertTranspose
function.
The I18N WG is admittedly pedantic about character encoding jargon. In this case, the meaning of "character" is intended to be a "user-perceived character", aka a grapheme or grapheme cluster. I would suggest:
Remove the definition of character
from the Terminology section, since it is only used on the one time in the entire document. This will avoid future revisions accidentally using the term in a different way.
Replace the term 'character' in insertTranspose
with the term grapheme
, linking from the [I18N-GLOSSARY]. (We created the I18N glossary since the last comments on this thread and it's specref referenceable)
Would you prefer a PR for this?
[from Addison Phillips]
https://w3c.github.io/input-events/#interface-InputEvent-Attributes
In section 5.1.2 there are multiple places where the term "character" is used without definition. It would be better to clearly define this to mean a Unicode code point.