How does characterboundsupdate interact with multi-codeunit characters?

marijnh commented 7 months ago

The spec doesn't seem to explicitly say that the number of rectangles passed to updateCharacterBounds should equal event.rangeEnd - event.rangeStart, but the example implementation does it that way, and it kind of seems implied by the fact that the browser needs to be able to find the appropriate rectangle for a given character by offset and the rectangles don't get explicitly associated with a specific position, except for their array position.

Since a given 'character' can take up multiple string positions, how should astral characters be handled here? Repeat their position multiple times in the array? If so, that seems non-obvious enough to mention explicitly. (But it also seems like a somewhat awkward solution, and defining this in such a way that the number of rectangles should match the number of actual unicode characters, not code points, between the given offsets, would also be reasonable, assuming the API can garantee that the queried offsets never fall in the middle of a surrogate pair).

marijnh commented 2 months ago

Hey, this seemed like a reasonable thing to ask clarification on—but the response has been absolute silence for 5 months. Is anyone steering this ship?

dandclark commented 2 months ago

Sorry for the delay here, this is a good question. IMO the most straightforward thing is to define it such that a bound is repeated in the array passed to updateCharacterBounds for each string position that makes up a given unicode character, even if that's a bit clunky. But I've added this to the Agenda to discuss on next week's WG call.

marijnh commented 2 months ago

Thanks for the response. There may even be a case to be made for making the granularity of this grapheme clusters, though those are still awkward to determine in JS. An interface that provides the client code with the ranges of the specific grapheme(s) it is querying seems preferable, but I'm guessing you wouldn't want to break backwards compatibility at this time anymore.

dandclark commented 2 months ago

The minutes from today's call:

08:16 dandclark: the problem is that updateCharBounds requires editor to provide bounds per character in the string, but what happens for a grapheme cluster that spans multiple characters in the string? 08:17 dandclark: e.g. 👨🏻‍⚕️would require >2 (4?) characters in the string, what does it mean to ask for bounds of [0, 1] in the string 08:17 dandclark: should we make the ranges based on grapheme cluster instead? 08:18 dandclark: proposal: can we keep it how it is today? web devs don't need to generally worry about grapheme clusters directly today — authors generally work in terms of JS string indices 08:19 smaug: would be good to get more feedback from web devs 08:20 dandclark: what I read is, we could ask for the ranges in grapheme clusters but the author would need to then worry about grapheme clusters anyways 08:20 q+ 08:21 dandclark: is grapheme cluster even consistent across browsers/platforms? would web dev need to worry about this 08:23 dandclark: right now, things tend to Just Work because the string indices line up with the backing store indices. because the backing store is a string.. 08:24 dandclark: ...but let's get more dev feedback. 08:24 possible we're missing a nuance 08:27 whsieh: can we bake the contract of "UA never asks for range that starts/stops in the middle of a grapheme cluster" into the spec? 08:27 dandclark: seems like a bug if a browser were to do that. not sure whether that would be a normative note 08:28 dandclark: (maybe a non-normative note) 08:28 dandclark: we'll need to be careful about terminology here 08:29 johanneswilm: browser knows internally where grapheme clusters start/end 08:30 dandclark: oh, wait — author might have a way of segmenting code points that disagrees with the browser/platform 08:30 dandclark: e.g. fully canvas-driven text rendering in JS 08:31 dandclark: maybe the browser shouldn't (generally) have an opinion about this

In summary it's still undecided which way we should go here, and we're going to ask for more developer feedback on which way is preferable.

TheSpyder commented 2 months ago

I haven't been keeping up on all the details of edit context, but I am an editor developer.

While having a range implementation based on grapheme clusters sounds great, that would make it different from every other DOM range which seems like a recipe for confusion and bugs. The little work I've done with clusters is mostly in UI, not editing, but we were recently able to switch that to Intl.Segmenter so I can say my concept of a written "character" has evolved to mean a grapheme cluster.

Looking at the method documentation for updateCharacterBounds() which describes the characterBounds parameter as "An Array containing DOMRect objects representing the character bounds", with no other context I would implement that using Intl.Segmenter and provide one DOMRect per grapheme. Perhaps the MDN example for characterboundsupdate should change to that?

This would imply that the browser-provided range request has offsets between clusters, which I think is a reasonable assumption to make. As more developers become familiar with grapheme clusters I would hope that they, like me, will start to read any mention of "character bounds" as implying "grapheme bounds".

johanneswilm commented 1 month ago

From TPAC 2024 minutes:

Dan: [explains issue and discussion at previous meeting] We will ask for each code unit. In case of a grapheme cluster, the JS will need to give back four times the same values if it’s the same grapheme cluster

Anne: “character was unfortunate choice

Dan: problem is: User may use their own font and complicated characters, and they may be rendered apart or together.

Anne: Across unicode revisions it changes what is a grapaheme cluster. I think code point would be nicer, but code units is more consistent with what we have otherwise. We don’t have a way good way of measuring code points, so we should go with code units. As long as you put in some links to infra standard (that defines code units, etc.).

Dan: Resolution: clarify that unit is code unit. And link to infra spec.

w3c / edit-context

How does characterboundsupdate interact with multi-codeunit characters? #96