Character counting in text 'x', 'y', 'dx', 'dy', and 'rotate' attributes.

Tavmjong commented 6 years ago

SVG 1.1 dictates that 'x', 'y', etc. values apply to characters (as defined in XML) in the description of these attributes. See:

https://www.w3.org/TR/SVG11/text.html#TextElement

https://www.w3.org/TR/SVG11/text.html#TSpanElement

SVG 1.1 also dictates that the number returned by getNumberOfChars() should be a count according to DOM 3. Non-rendered characters are to be included in the count.

https://www.w3.org/TR/SVG11/text.html#InterfaceSVGTextContentElement

DOM 3 defines strings as 16 bit units. Non-base plane Unicode points consist of two units. See:

https://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#DOMString

Normative text the SVGTextContentElement section appears to contradict that in the section discussing attribute value mapping in the <tspan> section in that the formal uses UTF-16 units while the latter uses XML characters.

SVG 2 attempts to make some clarifications:

An 'addressable character' is defined which applies to both mapping attribute values to characters and to character counting by getNumberOfChars():

https://www.w3.org/TR/SVG2/text.html#Definitions

Addressable characters are counted in units of UTF-16 code units and after white-space collapsing. They do not include characters in elements with 'display' value of 'none'.

https://www.w3.org/TR/SVG2/text.html#Definitions

Both these conditions seem to be a change from SVG 1.1 with respect to attribute value mapping.

https://www.w3.org/TR/SVG2/text.html#TSpanNotes

https://www.w3.org/TR/SVG2/text.html#InterfaceSVGTextContentElement

Tests

Test of UTF-16 code point counting:

http://tavmjong.free.fr/SVG/positioning-001.svg

Firefox 61 passes this test, Edge 15, Chrome 50/68, Android 5, iOS 8.3 fail. The latter do mapping by Unicode code points. Firefox and Chrome (only ones tested) do report the correct number of characters from getNumberOfChars() (see JavaScript console output).

I think most developers would expect mapping to be by Unicode code points and propose that we ask Firefox to switch and change the spec.

Test of effect of 'display:none':

http://tavmjong.free.fr/SVG/positioning-002.svg

Firefox 61 fails this test (as does Inkscape which doesn't handle 'display:none'). Edge 15, Chrome 50/68, Android 5, iOS 8.3 pass. Firefox includes characters in an element with 'display:none' in the count.

I believe Firefox does the right thing here. The mapping of values in an attribute shouldn't depend on a CSS value that can be changed at whim. However, as Firefox is the odd one out, it might be more prudent to follow the behavior of the other browsers and the spec as currently written.

css-meeting-bot commented 6 years ago

The SVG Working Group just discussed Character counting in text 'x', 'y', 'dx', 'dy', and 'rotate' attributes, and agreed to the following:

RESOLUTION: Assignment of multi-value text layout attributes (x, y, dx, dy, rotate) should be according to Unicode codepoint characters, not UTF-16 blocks.
RESOLUTION: Do not change the spec for character values with regards to display: none.

The full IRC log of that discussion

<krit> topic: Character counting in text 'x', 'y', 'dx', 'dy', and 'rotate' attributes
<krit> GitHub: https://github.com/w3c/svgwg/issues/537
<krit> Tav: did more investigation. SVG 1.1, what browsers but Firefox do the counting is done by unicode code points and not by UTF16
<krit> AmeliaBR: so an emotion just gets one value of the attribute?
<krit> Tav: look down in the issue an there are a couple of tests. The 1st examples demonstatets that.
<krit> Tav: see the 5 chars at the bottom and they get positioned by unicode points.
<krit> http://tavmjong.free.fr/SVG/positioning-001.svg
<AmeliaBR> s/emotion/emoji/
<krit> Tav: so the empty boxes mean you don't have the necessary font installed
<krit> AmeliaBR: in chrome the red and black versions don't line up. What does it mean?
<krit> Tav: if they line up then the chars get positioned following UTF16
<krit> AmeliaBR: seems to be the useful thing to do esepecailly since most implementations do this
<krit> krit: doesn't work in Ai yet because of UTF16 issues on import
<krit> AmeliaBR: you can try to use entities so that you can comment on the issue what Ai is doing
<krit> krit: will do
<krit> AmeliaBR: what about the DOM methods
<krit> Tav: if you open the console for the tests and look at the output... the DOM methods DO use UTF16.
<krit> AmeliaBR: which seems less useful
<krit> AmeliaBR: there are other DOM methods which read back the actual position and show how the actual layout happens. Those would still use UTF16 but would match up what actually is going to get used.
<krit> AmeliaBR: we need to clearly test what browsers are doing but if browsers use actual unicode characters then...
<krit> krit: To clarify: Browsers use unfixed code points for actual layout but UTF16 for DOM methods?
<krit> Tav: sounds correct.
<krit> AmeliaBR: If you say "give me character 2" it should return which character it is part of taking glyphs and everything into account already.
<krit> AmeliaBR: including UTF16 encoding
<krit> Tav: I think SVG 1.1 actually specs as browsers but Firefox implement it.
<krit> Tav: I think Cameron added a clarification.
<krit> krit: What testing is missing? Tav seemed to have some part of it.
<krit> AmeliaBR: I'd like to see the other DOM methods that read back position cross-browsers.
<krit> AmeliaBR: especially if there are compatibility issues for files that were exported to SVG and now would get positioned incorrectly on reading back.Mostly the non-web use cases.
<krit> krit: Ai would not use any of the DOM methods but I can provide feedback to the visual output.
<krit> AmeliaBR: the description should match up with SVG 1 and most implementations.
<krit> krit: we could reconsider later and resolve now.
<AmeliaBR> Proposed: Assignment of multi-value text layout attributes (x, y, dx, dy, rotate) should be according to Unicode codepoint characters, not UTF-16 blocks.
<krit> RESOLUTION: Assignment of multi-value text layout attributes (x, y, dx, dy, rotate) should be according to Unicode codepoint characters, not UTF-16 blocks.
<krit> AmeliaBR: there was another part of the issue how to collapse whitespaces. Any proposal on that one?
<krit> Tav: from a user perspective: if you change a CSS property the characters move around.
<krit> AmeliaBR: that is consistent how display: none works in CSS layout. In comparison visibility: hidden.
<krit> Tav: the hidden one would use a gap in the text
<krit> AmeliaBR: so you thing there should be a way where automatic layout adjusts but per character markup still applies regardless of the overall layout
<krit> Tav: yes
<krit> AmeliaBR: especially on manual kerning you wouldn't want to match the characters to other characters.
<krit> Tav: right. This is unpredictable in some cases.
<krit> Tav: markup values should be interpreted differently from CSS layout ideally. From a practical use case it might not be relevant.
<krit> Tav: the fact that every one does it as speced except Firefox it might not make it worth changing anyway.
<krit> krit: in the future there are alternatives to kerning with CSS but positioning characters individually is still popular like for iWorks on the cloud.
<krit> AmeliaBR: The workaround would be to put the char positioning on the individual span elements directly rather than the top text element. Would help on display none.
<krit> AmeliaBR: I agree with your conclusion that we should follow the majority of implementations.
<krit> Tav: that is how it is speced in SVG2.
<krit> AmeliaBR: ...and follows previous resolutions.
<krit> proposed RESOLUTION: Do not change a previous resolution for character values with regards to display: none.
<krit> AmeliaBR: could you check if there might be issues on Firefox?
<krit> RESOLUTION: Do not change the spec for character values with regards to display: none.
<AmeliaBR> Here's a Firefox issue re display: none https://bugzilla.mozilla.org/show_bug.cgi?id=1141224
<AmeliaBR> Will need a new issue once the spec for unicode vs UTF-16 is ready.

AmeliaBR commented 6 years ago

Hi Tav. I'm reviewing the issues related to your open PR, and I noticed this relevant comment you made back in 2016:

The definition of "character" is from SVG 1.1. I believe it is meant to correspond to a Unicode point. In terms of input, a 'u' with a combining '`' would be two points while using the preformed 'ù' is one point. This has mostly to do with how the 'x', 'y', ... attributes are matched to the input.

Did you do any tests about whether browsers normalize these types of strings before assigning layout attribute values?

AmeliaBR commented 6 years ago

Ok, to answer my own question, here's a test: https://codepen.io/AmeliaBR/pen/72dceba63f82c5433ee6ca6f8be5304a/

The first two accented characters are single codepoint characters, the second two use combining characters, and the final one is just me stacking a whole bunch of combining characters together.

Results:

Firefox (v63) treats the combined accent+base character in the way the spec defines for ligatures: it gets laid out as a single character, but a dy value is consumed by the accent and accumulates in the position of the following character.
Chrome (v68) and Safari (v11) lay out the combining accent as its own character, offset by the dy value assigned to it.
Edge (v17) lays out the combined accent+base character the exact same as the single-codepoint accented characters (one unit, one dy value), even for accent combinations that can never be normalized to a single codepoint.

I'm going to make a firm argument that the Chrome/Safari behavior (shifting the combining accent relative to its base character) is wrong. But I'm not sure whether there is any interest in adopting the Edge behavior, which is probably more intuitive for authors, at least for the cases where the combining accent looks identical to a single codepoint version.

fsoder commented 6 years ago

From Blink's PoV I'd agree that the current behavior is incorrect, and I think we'd be happy to adopt the Edge behavior (assuming that is "per grapheme cluster".) I agree with the statement that that behavior is likely the most intuitive for authors.

AmeliaBR commented 6 years ago

In order to spec the Edge behavior, we'd need a clear way to define how characters (codepoints) get grouped into layout units. And it would need to be something that can be un-ambiguously defined solely by the character content, not based on a font. (Because we wouldn't want the assignment of layout attributes to characters to vary according to which font is used.)

Does the "typographic character" term as defined in CSS 3 & referenced in SVG 2 meet that requirement? Or does it also include font-specific groupings such as ligatures? @Tavmjong @fantasai

Maybe it would be better to directly reference Unicode character classes, e.g. to specify that combining/modifier codepoints get skipped over when assigning layout characters. The result might not be as smart about language-sensitive groupings, but it would be more clearly testable for consistent results between user agents.

AmeliaBR commented 6 years ago

Not an expert on Unicode, but I think what we'd want is the definition of a "base character":

Base Character. Any graphic character except for those with the General Category of Combining Mark (M). (See definition D51 in Section 3.6, Combination. [PDF]) In a combining character sequence, the base character is the initial character, which the combining marks are applied to.

So then the SVG rules would assign attribute values to the Unicode base characters in the text. Actual layout would still need special rules for ligatures and other clusters which are laid out as a whole based on font-specific rules.

Upside: This is a good balance of intuitive and unambiguous. Downside: There's no easy JS way (as far as I know) to identify how many "base characters" there are in a string.

fsoder commented 6 years ago

I think using (extended) grapheme cluster (UAX#29) would be better in that case. They can be determined from code points. The "determine from JS" bit isn't solved there yet (I think...), but there are proposals [1] ( and probably polyfills) to do make that functionality available.

[1] https://tc39.github.io/proposal-intl-segmenter/

svgeesus commented 6 years ago

typographic character (from CSS Text 3 and SVG2) is identical to UAX29 extended grapheme cluster:

Unicode Standard Annex #29: Text Segmentation defines a unit called the grapheme cluster which approximates the typographic character. A UA must use the extended grapheme cluster (not legacy grapheme cluster), as defined in [UAX29], as the basis for its typographic character unit.

css-meeting-bot commented 6 years ago

The SVG Working Group just discussed Character counting on dx/dy properties, and agreed to the following:

RESOLUTION: Complex script should be rotated and moved together

The full IRC log of that discussion

<krit> topic: Character counting on dx/dy properties
<krit> GitHub: https://github.com/w3c/svgwg/issues/537
<krit> chris: Correct behavior is what Edge does I think
<krit> chris: (describes Edge behavior as written in the issue)
<krit> chris: This is using CSS3 Text typographic characters.
<krit> chris: are all implementations agreeing to do what edge does.
<krit> Tavmjong: that requires that you have a library or some way of knowing what clusters go together
<krit> Tavmjong: So ppl with different libraries should have the same behavior.
<krit> Tavmjong: for predictability, using unicode characters might be more predictable.
<krit> chris: kind of
<krit> AmeliaBR: it is more author friendly to use typographic character but adds more complication to implementation
<krit> Tavmjong: I don't know of a library that would be able to do this right now.
<krit> AmeliaBR: many rendering implementations that need 3rd-party libraries are in the same position.
<krit> Tavmjong: we use Tango.
<krit> chris: I think Tango supports it. At least Freetype does that.
<AmeliaBR> s/Tango/Pango/
<chris> https://mail.gnome.org/archives/gtk-app-devel-list/2008-May/msg00083.html
<krit> Tavmjong: my guess is what Pango does is relying on the information in the font
<krit> chris: hm, not so sure if that is the case.
<chris> suggest asking Behdad
<krit> Tavmjong: I am not convinced of the Edge behavior right now. I don't think we can implement it right now.
<krit> Tavmjong: My testing showed that everyone was using unicode points.
<krit> AmeliaBR: Based on my browser testing, Edge is the only one using glyph clusters, FF uses unicode character but lays them out by glyphs, Blink and WebKit separate accents from their base characters.
<krit> RESOLUTION: Complex script should be rotated and moved together
<krit> Tavmjong: the question is about counting now.
<krit> krit: do you think you can check and get back to the WG with your implementation results?
<krit> Tavmjong: I think I can get some data.
<chris> I think Harfbuzz does the UAX29 segmentation https://lists.freedesktop.org/archives/harfbuzz/2015-September/005083.html
<krit> AmeliaBR: maybe ping a few other ppl on the issue for feasibility. I think FF uses Pango on some platforms too.
<krit> Tavmjong: that would be a changed behavior to SVG 1.1
<krit> chris: yes it would
<krit> AmeliaBR: we have inconsistency anyway.
<krit> Tavmjong: I'll look into it by next week
<krit> AmeliaBR: we already has resolutions on the simpler cases

dirkschulze commented 6 years ago

From Adobe's perspective we would prefer a definition that works cross specification. CSS3 typographic characters seems to make most sense.

Tavmjong commented 6 years ago

@r12a Could you comment on this issue? We are debating between two ways of counting characters:

Using Unicode Code points (as per SVG 1.1).
Using Extended Grapheme Clusters (EGC) per UAX#29. I'm concerned that (2) requires SVG renderers to be able to determine EGC's for all scripts in order to reliably apply the attributes.

Tavmjong commented 6 years ago

CSS 3 Typographic characters: https://www.w3.org/TR/css-text-3/#characters Note that Example 1, second point, gives a case where the Typographic Character is different depending on the operation (spacing vs. line-breaking). Which definition would apply to mapping attribute values?

css-meeting-bot commented 6 years ago

The SVG Working Group just discussed Character counting in text 'x', 'y', 'dx', 'dy', and 'rotate' attributes.

The full IRC log of that discussion

<AmeliaBR> Topic: Character counting in text 'x', 'y', 'dx', 'dy', and 'rotate' attributes
<AmeliaBR> github: https://github.com/w3c/svgwg/issues/537
<AmeliaBR> Tav: After investigation, I'm even less comfortable switching to "extended grapheme cluster" or "typographic character" for counting, because it's just not clearly defined.
<AmeliaBR> ... Definitions can vary according to the particular use case, e.g. line breaking vs layout.
<AmeliaBR> ... Nice in principal, but needs a really expert approach.
<AmeliaBR> Chris: So how would authors handle e.g., combining accent characters? Would they need to insert, e.g., and extra 0 value in dx?
<AmeliaBR> Tav: Yes, as defined in SVG 1.
<AmeliaBR> Chris: It's a bit of a pain for authors, who often don't have transparency about whether the platform is using precomposed accents or not.
<AmeliaBR> ... Have you got a response back from Behdad?
<AmeliaBR> Tav: Not yet. I also pinged @r12a (Richard Ishida) on the issue.
<AmeliaBR> Amelia: Goal should be to balance best for linguistics with something that can be reliably implemented.

css-meeting-bot commented 6 years ago

The SVG Working Group just discussed Update on character counting for layout attributes on text/tspan elements, and agreed to the following:

RESOLUTION: Keep unicode code point for now until we get feedback from implementers. Keep previous resolution.

The full IRC log of that discussion

<krit> topic: Update on character counting for layout attributes on text/tspan elements
<krit> krit: Tav, saw you discussed on the mailing list?
<krit> Tavmjong: got a response. He said you can get the breaking points from pango. Still not sure if that is the easiest way to do. He wants to avoid the CSS wording because that depends on context.
<krit> Tavmjong: The spec says that breaking may depend on context. So you'd break at different points. Unicode might work but I'd not say this is the way to go.
<krit> krit: would you approve if we use CSS and ask to clarify what context awareness means?
<krit> Tavmjong: if we have.a set of numbers that need to apply to the same groups of chars. And CSS3 it depends on the context.
<krit> krit: it'd be great to understand what the context is
<krit> AmeliaBR: we had this issues with white space collapsing.
<krit> Tavmjong: In the email says that he likes the Edge behavior better with the cluster selection. Pango returns an array that are well defined clusters in unicode.
<AmeliaBR> Behdad's reply: https://lists.w3.org/Archives/Public/www-svg/2018Sep/0018.html
<krit> Tavmjong: if we are to switch from unicode code points, it would be the thing to switch too.
<krit> krit: could it happen that :first-letter selector can have a different meaning on layout and rendering?
<krit> AmeliaBR: :first-letter has a different set of settings.
<krit> AmeliaBR: for layout it is a different and predictability is more important.
<krit> Tavmjong: my suggestion would be to leave unicode code points and add a note with a request of comments from implementers.
<krit> AmeliaBR: the limitation of leaving would be the inconsitencies and we can not file bugs on browsers until we decided how to go forward.
<krit> krit: The CSS has more text experts... is that something we should bring it up there or is it completely independent of CSS and its definition of typographic characters?
<krit> Tavmjong: I think it is independent. We can not use typographic chars from CSS since you might break at different positions dependent on the context. That would be unpredictable and not consistent.
<krit> Tavmjong: so either use code points or Edge's behavior of clusters. (which is not known detaul)
<krit> s/detaul/detail.
<krit> chris: surrogates are in UTF16 and 2 sets allow defining one character and older implementations do not understand this
<krit> AmeliaBR: this is how we even got into it
<krit> Tavmjong: only FF supports this but no one else.
<krit> AmeliaBR: we are going to file issues against specs. We need to decide on it to fix other issues.
<AmeliaBR> github: https://github.com/w3c/svgwg/issues/537
<chris> https://en.wikipedia.org/wiki/UTF-16#U+10000_to_U+10FFFF
<krit> krit: Maybe going with Tavs proposal and ask for browser input would unblock us for now.
<krit> krit: How can we bring this to their attention?
<krit> AmeliaBR: Tav, could you go though the text that may need changes and show how it would affect output if we are going to change?
<krit> Tavmjong: I can create a PR with the changes.
<krit> RESOLVE: Keep unicode code point for now until we get feedback from implementers. Keep previous resolution.
<krit> RESOLUTION: Keep unicode code point for now until we get feedback from implementers. Keep previous resolution.
<krit> AmeliaBR: Can it handle multi Byte characters and can it handle the 2nd issue?
<krit> chris: both do not affect western text
<krit> Tavmjong: emoji is a good example
<chris> s/do not/*do*
<krit> Tavmjong: some emojis use colors?
<krit> chris: exactly, you may need to combine characters.
<krit> Tavmjong: maybe w good way to test
<krit> s/maybe w/maybe a/
<krit> Tavmjong: chris, could you send me an example with emojis? Then I'd create a test out of it.
<krit> chris: yes

r12a commented 6 years ago

Sorry to be late to the party. (Btw, if you add an i18n-tracking label to an issue, it should pop up in our WG daily notifications, so others may have seen while i was travelling. That may help next time.)

This is an interesting discussion. I don't think i have a clear answer for you, but i may be able to help a little. You may have found it useful to refer to some new material, added recently to one of our articles, that describes code points vs grapheme clusters vs typographic character units, however i think you probably understand most of that stuff now. Note in particular that i believe you have correctly identified that the CSS typographic character unit is very contextually dependent. Here are a few other thoughts from me, off the top of my head...

First, i think it was always a BAD MISTAKE to ever define strings in terms of UTF-16 code units. In order to apply offsets per Amelia's test you'd have to be aware of which characters were supplementary chars and which weren't in order to create a step of characters. The same applies for counting things, since you never want to separate the two UTF-16 code units that make up a single character.

If you go with grapheme clusters, users may still get some odd effects unexpectedly. Take the following example in Bangla: kshī (ক্ষি) is made up of two grapheme clusters. If you were creating Amelia's stepped character display, you'd end up with

screen shot 2018-10-02 at 13 44 53

rather than all grouped together like

screen shot 2018-10-02 at 13 43 15

The reason this isn't taken care of by Unicode grapheme cluster rules is that it's tricky. What constitutes a user-perceived character in this case depends on which script is being used, and to an extent on what the font does too, since it's only a single user-perceived character if the sequence forms a conjunct (ie. the glyphs are combined into a unit).

Apart from that, I'd certainly like to be able to highlight code points sometimes rather than grapheme clusters - eg. when colouring diacritics or other combining characters in educational material, or even sometimes when explaining grapheme clusters to be able to colour each component part differently!

Of course, one encounters similar problems with code points. The stepped character display would look even worse if it showed up as

screen shot 2018-10-02 at 13 51 41

On the other hand, if you wanted to explain to someone what characters make up that conjunct (perhaps with horizontal movement rather than vertical) this could be quite useful.

It seems to me that perhaps a stepped character display like Amelia's test would probably always need to be hand crafted, so that the right things stick together(?)

However, counting characters is perhaps something else. As i said before, i wouldn't want to use UTF-16 code units for counting, any more than i'd use bytes. I also think that grapheme cluster counts don't give enough precision for some use cases, and it's possible that the rules for what constitute a grapheme cluster may be extended too in the future. I think that code points are probably the best way to go.

As far as emoji go, here we are entering a world where the question of what constitutes a unit becomes even further complicated. This is because an emoji picture can be made up of many component parts. Perhaps a useful example can be found in the slides i just put together for Paris Web – see the juggling girl and family emojis at https://www.w3.org/International/talks/1810-paris/index.html#truncation.

screen shot 2018-10-02 at 14 08 39

I don't know how helpful all that is, but hopefully a little.

dirkschulze commented 5 years ago

I am including @litherum in this discussion. He participates in the process around the https://drafts.css-houdini.org/font-metrics-api Houdini specification. Maybe he has some additional feedback.

dirkschulze commented 5 years ago

Including the spec authors of Font Metrics API, @eaenet and @kojiishi, to this discussion as well.

litherum commented 5 years ago

We should standardize on either caret positions or grapheme clusters. Code units or code points are almost certainly wrong. There should be no difference between é and e + combining acute accent

kojiishi commented 5 years ago

I agree with @litherum -- while there are some cases people wants to edit part of grapheme clusters as @r12a pointed out, we're trying to make most formatting and API applicable only to grapheme cluster boundaries.

litherum commented 5 years ago

The spec should probably include an example that shows how text-on-a-path works with Arabic. Honestly, it’s probably impossible to do it well.

AmeliaBR commented 5 years ago

@litherum That's not related to this issue, but for your reference:

The spec should probably include an example that shows how text-on-a-path works with Arabic.

There are figures with Arabic text-on-a-path illustrating the method options: https://svgwg.org/svg2-draft/text.html#TextPathElementMethodAttribute

Honestly, it’s probably impossible to do it well.

Not impossible, but few rendering agents have tried very hard.

No browser supports the "stretch" option for method (which guarantees smooth connections by distorting glyphs in addition to rotating them).

Stretching isn't strictly necessary, of course. For mildly-curving paths, the glyph overlap from the font should preserve the sense of connection. But, that doesn't mean you will get good results in current browsers.

I haven't tested recently, but there are some more demos from my SVG Text book here if you want to explore current rendering. As of the book's publication, Firefox was the only browser with typographically acceptable results. Chrome seems to still mess up the RTL ordering and use isolated forms instead of contextual ones. Don't have Safari to test with at the moment.

If you have any suggestions for spec improvements while looking into that, please do raise separate issues.

dirkschulze commented 5 years ago

@litherum are the mentioned caret characters something that was discussed in this issue under a different name maybe? Could you please clarify?

@Tavmjong Could you be more specific what your concerns with UAX#29 with regards to scripts are?

litherum commented 5 years ago

Caret positions are just another text segmentation type, like “character,” “word,” or “line.”

dirkschulze commented 5 years ago

@litherum gotcha.

CSS UI defines the caret position in form of characters. Same spec refers characters to:

Within the context of this definition, character is to be understood as extended grapheme cluster, as defined in [UAX29], and visible character means a character with a non-zero advance measure. https://www.w3.org/TR/css-ui-4/#caret-shape

Given that caret is determined by using EGC, typographic characters are defined by using EGC as well Font Metrics API is defined by using EGC it would make sense to refer to EGC for “character counting” in SVG text for Layout as well IMHO. Especially caret position and layout position should be consistent or we would see weird results.

litherum commented 5 years ago

On macOS and iOS, caret positions are not the same thing as EGC.

Using EGC for character counting is probably the best solution.

fantasai commented 5 years ago

I'm going to agree with @r12a here. Unicode code points are the right way to go for this particular application. The reason is that the coordinate-character pairing must remain stable across Unicode revisions, otherwise the SVG graphic will break over time, and ECGs might be corrected in the future as they are a user-facing construct in which correctness is more important than stability. Using codepoints will also be less likely to break existing SVG graphics, since it's backwards-compatible for any content that isn't using higher-plane codepoints.

You can allow the UA to ignore coordinates specified on any combining characters within an ECG or typographic grapheme cluster, whichever is greater. This limits what nonsense the UA needs to handle, but does not break the pairing counts.

kojiishi commented 5 years ago

If we allow UA to ignore operations within an EGC, and if EGC definition is corrected in future, it will change the rendering, no?

I'm fine to take any encoding, as long as operations are limited to EGC as a unit, so I'm not opposed to. But if UAX#29 changes drastically in future, it's not easy to maintain the full backward compat regardless of what we chose here, for this feature or for any other features such as first-letter or line breaker. I think we should consider such changes as an improvement rather than a breakage.

litherum commented 5 years ago

You’re totally right, @kojiishi.

fantasai commented 5 years ago

@kojiishi It will change the rendering only if the author has assigned a coordinate for a codepoint within the EGC and a UA decides to change how it handles coordinates assigned to such characters. (It would not be required to make such a change, since the definition of typographic character unit is malleable, and whether a coordinate assigned within the typographic character unit is honored would be up to the UA.)

However, such a change would only affect that particular EGC; it would not re-pair all the coordinates after it with different bits of the text, which is what would happen if you count by EGC or anything else that changes over time. This is what's important: to ensure the pairing between the list of coordinates and the text remains constant over time.

litherum commented 5 years ago

Humans, in general, don't understand code points. Even programmers don't generally understand code points. Unicode describes grapheme clusters as "user-perceived characters" which is what someone would actually want when they tell the computer to place a thing.
Performing Unicode normalization should not result in different behavior. An implementation detail of the author's particular keyboard shouldn't move text around. There needs to be no visual distinction between é and e + combining acute accent. They are equivalent.
Code points are associated with Unicode's encoding, but user-perceived characters are, conceptually, not. A significant amount of the web is authored in non-unicode encodings, and they may not even know how their text maps to Unicode. They will know how user-perceived characters work, because they are a user and they perceive characters.
Fonts describe shaping rules about how to place code points (represented as glyphs that correspond to code points). A font describes, for example, how to place the combining vowel marks in Arabic, and how to assemble each of the family members in an emoji like 👨‍👩‍👧‍👦. Each font places them differently, and having the SVG content place these combining marks would make these kinds of texts unreadable if a new font is used (or would break up a family?).
If grapheme cluster boundaries change drastically, we have bigger problems than just the x= attribute in SVG content. Indeed, when Unicode wanted to make a substantive change to grapheme cluster boundaries, they didn't change the old behavior, they just added a new type of grapheme cluster - "legacy grapheme clusters" vs "extended grapheme clusters."
Do we have use counters for more than one entry in these position attributes? If the argument is about compatibility, we need to know how much content there is that could possibly be broken.
I do understand the argument about forward compatibility, but I can't differentiate the idea of "ECGs might be corrected in the future" from any other progressive behavior change we make in browsers. In general, causing a behavior change via improvements to the browser is a good thing.

fantasai commented 5 years ago

Humans, and computers, also have trouble with defining "user-perceived characters". Hence why we have the term “typographic character unit” in CSS3 Text: it has to mean different things in different contexts, so pegging to EGC isn't feasible.
If normalization is a concern, normalize the string first. The Web platform as a whole has given up on normalization for string matching, but it would probably be fine to do here for counting purposes.
Yes, ECGs probably give a better UX here, but stability across time is more important than stability across encodings.
Right. Which is why the UA would be allowed (or possibly required) to ignore values that are paired with a combining mark.
Right. Because they had some concerns about backwards compatibility. We shouldn't exacerbate the situation and give them more constraints they need to worry about next time they need to fix the grapheme cluster spec.
Agree that use counts would be nice to have.
The progressive improvements we make don't break pairing. They don't change, e.g. whether a CSS declaration is assigned to a particular div or its next sibling. Making this depend on grapheme clusters would create that kind of instability. I don't think it's a good idea.

fantasai commented 5 years ago

Just to clarify, the proposal I'm backing here is:

Use code points for counting, because they are stable and pairing must remain stable. This gives us both forwards and backwards compatibility (except for surrogate pairs, which would not be backwards-compatible).
UAs (may/must?) ignore coordinates associated with combining characters (i.e. not the base character) in any typographic character unit.

Additionally, I would suggest that SVG add a no-op symbol to the coordinate list syntax, which would assign the character a position relative to the previous character (as dictated by the font), so that the author doesn't have to give a coordinate for each codepoint.

AmeliaBR commented 5 years ago

A reminder that SVG already has rules for font-specific combined characters, whether those are traditional ligatures or combined emojis. They are supposed to be positioned as a single glyph-cluster, and coordinates assigned to additional characters in the combination sequence are either ignored (if they were absolute values) or accumulated in the next independently-positioned character (if they were relative adjustments).

This prevents the rest of the text layout from breaking completely if you switch between fonts (or layout engines) that do or don't support the ligature. The assignment of coordinates in the list to characters in the text must never be affected by the font used.

Re usage & breaking changes:

Much of the usage of this feature comes from visual editors, which allow designers to manually tweak kerning, or set letter positioning absolutely, & the software converts that to coordinate sequences. So the software will know whether it is using a pre-composed accent or a set of combining characters, and will add the correct number of values accordingly.

That said—as one of the rare people who does write these types of attributes by hand, it would be nice if I could look at text & know how many values are required.

Additionally, I would suggest that SVG add a no-op symbol to the coordinate list syntax, which would assign the character a position relative to the previous character (as dictated by the font), so that the author doesn't have to give a coordinate for each codepoint.

This would be a nice addition to the absolute position attributes (for the relative position attributes, you can already use 0 for this result). Not sure what that would look like.

litherum commented 5 years ago

Use code points for counting, because they are stable and pairing must remain stable. This gives us both forwards and backwards compatibility (except for surrogate pairs, which would not be backwards-compatible).

UAs (may/must?) ignore coordinates associated with combining characters (i.e. not the base character) in any typographic character unit.

This is defining a new type of Unicode segmentation, which should be done in Unicode, not CSS. Web engines should not be in the business of defining their own segmentation types, because Web engines are not the only software with forward-compatibility concerns.

My intuition is that a segmenter that's just like code points except for combining characters would be too naive. Bringing this to Unicode would get the necessary domain experts to provide feedback on this type of segmenter. The most obvious counter-example is the family emoji, where the family members are not combining marks, but I will do some additional research determine the relationship between this new segmenter type and EGCs.

SVG already has rules for font-specific combined characters

Neither code points, nor EGCs, nor this new segmentation type are font-dependent.

r12a commented 5 years ago

I think i'm beginning to understand that the question here is not really about simply counting character, but rather about segmenting text for display(right?).

I pulled together an exploratory test at https://w3c.github.io/i18n-tests/quick-tests/svg-counting/svg-counting-001 where i tried to think of the various permutations of characters that you might expect to glob together. (If you scroll down on the test page, there is a key.) Here are the results, when testing on my Mac/Windows10 machines:

Firefox firefox

Chrome chrome

Safari safari

Edge edge

Chrome and Safari step codepoint by codepoint.

Firefox appears to segment on grapheme cluster boundaries, which means that the first conjunct is split (ugly). However, it does keep the repha with the later conjunct. It also keeps the burmese characters together, although there are two grapheme clusters involved.

Edge looks best to me. It keeps together everything i would have expected.

All browsers keep the arabic lam-alif ligature together.

r12a commented 5 years ago

Edge also keeps the stepping clean, whereas Firefox leaves blank spaces after some combining characters.

r12a commented 5 years ago

I also added a test for cursive arabic at https://w3c.github.io/i18n-tests/quick-tests/svg-counting/svg-counting-002. I didn't apply any direction to the string.

All browsers split each letter, but Firefox & Edge retained the joining forms of the glyphs. Chrome and Safari used isolated glyph forms. FF/Edge cascaded RTL; Chrome/Safari, LTR.

Firefox/Edge firefox-ar

Chrome/Safari chrome-ar

litherum commented 5 years ago

If Edge has the best behavior, do we know what type of segmenter they are using? @atanassov

dirkschulze commented 5 years ago

As far as I understood, @r12a, @Tavmjong and @fantasai support unicode code points because they are more robust and stable.

The preferred output on the other side would be what Edge does. Which is closer to EGCs but we are not sure about that. @litherum and @kojiishi support the latter approach. Even if the output might change/get corrected over time.

I didn't see many authoring tools using multiple values for x, y, dx or dy. So maybe we can assume that current content get hand edited for the most part?

For hand edited content, do we expect that stability is more important than the semantic correctness? Given that authors went to the effort to actually align the "chars" correctly? Note that Chrome/WebKit seem to do char mapping by unicode code points for interfaces like getNumberOfChars() as well. (Even though it is not what is currently speced.) For hand edited content it might be better if those align?

It would be great if we can get to a compromise on this issue and be able to close it.

litherum commented 5 years ago

I’d love to know if multiple values in x, y, dx, or dy is rare enough that we can just .... stop supporting it. 🤭

dirkschulze commented 5 years ago

@litherum IIRC, Apple's online office suite did use SVG and position glyphs heavily. Not sure if this is still the case or how that team worked around interoperability issues.

Even in Adobe Illustrator I could imagine that we would use it in the future to reduce the size of the SVG files (reduced SVG file size is a high demand).

r12a commented 5 years ago

As far as I understood, @r12a, @Tavmjong and @fantasai support unicode code points because they are more robust and stable.

For general counting of characters, counting characters is often the best, since it's more reliable.

For automatically segmenting text in order to display it, i think i prefer Edge's behaviour, which seems to be using grapheme clusters augmented with tailoring rules to capture full indic conjunct-based orthographic syllables in devanagari. (I don't know what secret sauce they are using, but i added an extra test, and note that it also avoids treating Tamil consonant clusters as a unit (which is good, see https://github.com/w3c/iip/issues/18 for details, if you need them)).

Tavmjong commented 5 years ago

@litherum Inkscape makes heavy use of multiple values.

dirkschulze commented 5 years ago

If we should not get to an agreement, we can keep the exact character counting algorithm unspecified explicitly. In this case we would (ideally) hint which output is preferred.

However, I really hope we get to a resolution that would get implemented interoperable eventually.

r12a commented 5 years ago

Thinking more about this, i came across another potential issue. Complex scripts use things like RLI (specifies base direction for bidi text), ZWNJ (stops cursive joining in scripts like Arabic & breaks conjuncts in scripts like Devanagari), FVS (applies a specific variant shape to a Mongolian letter). These are all invisible characters in Unicode, and i believe that none of them are combined with other characters when grapheme cluster segmentation takes place.

If we are creating offsets by automatically counting grapheme clusters or code points, the result would presumably be a gap where one of these characters appears. Perhaps one way to deal with that is to establish an exclusion list for this type of character, however i don't know whether or not that would come with it's own problems.

Whatever is done wrt spatial placement of glyphs, the effects of those characters need to be applied to the appropriate adjacent character.

Here's another test that looks at what browsers do with the invisible characters above.

Firefox leaves a gap for ZWNJ and FVS, but not for LRE/PDF, but does apply the expected effects, except for bidi reordering (the AB is in the wrong place). Chrome leaves gaps for all, and doesn't apply expected effects for ZWNJ and FVS, but does for bidi. Edge leaves a gap for ZWNJ only, and applies expected effects for ZWNJ and FVS, but not bidi.

r12a commented 5 years ago

Looking at the same test apart from the invisible characters, this test shows some interesting bidi behaviours.

The right-to-left cascading seen in Edge and Firefox looks wrong to me. Even though these are RTL characters, i would have expected the offsets to all be from left to right. The changes in direction for "ab" seem odd and unwarranted.
The order of characters and the placement of the 'ab' relative to the rest of the Hebrew text is different between Firefox/Edge and Chrome.

r12a commented 5 years ago

I don't know what to suggest. This segmentation issue is more of a problem for this feature than for some other contexts, due to the fact that visual placement and separation is involved. Some options that come to mind are:

Add a note to say that automatically positioning parts of a text string in this way is likely to be problematic for complex scripts, and authors may not be able to use this feature effectively much of the time for those scripts. That doesn't seem a very satisfactory solution for users.
Change the syntax, so that the string in the text element becomes a list, where the content author groups things together as they want, rather than trying to figure things out automatically. That's probably not a welcome change to the spec.
Do some more in-depth research around the segmentation process across various scripts to understand whether there is a standard set of rules that can be applied here so that things just work (Edge seems to get close already, but we need to do further testing, and find out what their secret sauce is.) This could take a while.

BigBadaboom commented 5 years ago

Is it maybe time to assign this to the too-hard basket, and go with @r12a's option 1 and 2? That is, deprecate the multi-value attributes, and instead recommend that authors use <tspan> for the situations when the code point algorithm fails?

However that leaves a problem regarding the rotate attribute. I know the text layout algorithm is already quite complicated, but one possible solution would be to allow transform on <tspan> elements.

https://codepen.io/PaulLeBeau/pen/qQQPag

fantasai commented 5 years ago

I don't think this is too hard. You pair off coordinates and characters using codepoints, since those are stable for counting, and when there are multiple codepoints that belong to a single typographic character unit, you render them together like Edge does (ideally), handling ignored coordinates the same way ligatures are handled per @AmeliaBR’s comment above.

fantasai commented 5 years ago

See Tav's proposal in https://github.com/w3c/svgwg/issues/260#issuecomment-414432723 ... this is what we should do (leaving codepoint vs UTF-16 byte pair to be sorted out by the SVGWG in consideration of compat).

w3c / svgwg

Character counting in text 'x', 'y', 'dx', 'dy', and 'rotate' attributes. #537