Itemization - Githubissues

PeterConstable commented 3 years ago

During the last FTCG call, John Hudson suggested, as a pilot, that a project be started to write a spec for itemization. Unicode was suggested as an organizational venue for that, and I was asked to check if Unicode would be open to that. There is openness to that, though with a question as to whether there are enough UTC participants interested in volunteering time for this or other text display-related topics.

So, I've been giving more thought to the topic of itemization. I'd like to get a sense of the scope or problem: what is the current gap that needs solving. I've suggested some problem statements below; I'd like comments on these.

During the discussion in the last call, it was mentioned as being relevant for shaping but also for font fallback. I've seen some describe itemization as related only to font fallback, e.g., here, but that is not correct. For example, Uniscribe's ScriptItemize() function is explicitly intended for shaping purposes and must be called before making calls to ScriptShape(). Similarly, in DWrite, DWriteTextAnalyzer::AnalyzeScript() does itemization and must be called before making calls to IDWriteTextAnalyzer::GetGlyphs() (DWrite's equivalent to Uniscribe's ScriptShape()).

Also, I got the sense from John that his concern was related to shaping. So, I want to focus on shaping for the moment.

In relation to shaping, it occurs to me that itemization is probably an OpenType-specific thing, and that it isn't relevant for shaping with AAT or Graphite. For instance, while I haven't worked with CoreText, I've glanced through docs (e.g., CTLine) and haven't come across any functions that return items—though that doesn't rule out something done internally. Still, it seems like neither should need itemization for shaping. Would someone more familiar with CoreText or Graphite be able to confirm?

In relation to OpenType shaping, one of the itemization issues has to do with how data is organized by script in the GSUB and GPOS tables: itemization will segment the text into script runs, and shaping engines can use that script assignment (or always do?) to determine what script tag in the font is used to access features and lookups. In this regard, I see the following potential problems affecting interoperability between fonts, applications, and content from different sources:

Problem: If a font developer doesn't know how a shaping engine will determine what script tag it will use for a given string to retrieve feature/lookup data from the font, it's not clear to the font developer (or font tool developer) how the font data should be organized. This hinders interoperability between fonts, applications, and content from different sources.

Problem: If different layout engines have different logic for how the script tag is selected, then the same font and content could display differently in different applications. This breaks interoperability between fonts, applications, and content from different sources.

Itemization also affects how runs of text get routed to different shaping engines. Since different shaping engines will have different logic (else they wouldn't be distinct engines), the choice of engine could affect how certain character sequences get processed. This would particularly apply to script-neutral characters, though maybe could affect script-specific characters in some cases.

Assuming the logic used in different shaping engines is known (that's a separate issue, but assuming for now), I see the following potential problems affecting interoperability between fonts, applications, and content from different sources:

Problem: If a font developer doesn't know which shaping engine would be used for a given string, then it may not be clear how features and lookups should be written to obtain desired results.

Problem: If different layout engines route certain strings to shaping engines in different ways, then the same font and content could display differently in different applications.

Now, these latter two assume there are cases in which a given character sequence might get routed to different shaping engines and would be shaped differently by the different engines. It's a conceptual possibility, though I don't know offhand if there are actual cases in which this might happen. (Certainly script-neutral characters can get routed to different engines if they are merged with strong-script characters, but are there cases in which they would they get shaped differently by the different engines?) If anyone can provide examples, that would be useful.

Would others agree that the problems suggested above are real problems for which an itemization spec might be useful? Are there other problems related to shaping that might be called out?

PeterConstable commented 3 years ago

it occurs to me that itemization ... probably ... isn't relevant for shaping with AAT or Graphite.

Graphite documentation mentions the ScriptTag variable that can be used in the Graphite Description Language, but it doesn't mention what values need to be used (or if it matters), or how that variable gets used.

tiroj commented 3 years ago

Very good summary, Peter.

The biggest concern for me is not knowing how different layout processes will segment glyph runs, and hence whether some substitution or positioning lookups will be applied or not. Discrepancies are mostly with script=Common characters such as punctuation in various kinds of relationship to string-script characters, with some more complex cases around interaction of bidi layout and run segmentation.

khaledhosny commented 3 years ago

The most frequent issues I had with script itemization is related to Common code points. For example, in one font I have an Arabic-specific form of period (.), substituted using locl feature, it works fine most of time except after digits. Some implementations will do script itemization on individual bidi runs and as such the period, being in a run of its own, does not get assigned Arabic script and substitutions will not be applied.

Similarly, applying script-specific substitutions to digits is not reliable.

Handling of paired punctuation is different across implementations.

NeilSureshPatel commented 3 years ago

Thanks for the summary Peter.

My experience is similar to @tiroj and @khaledhosny. The issues I have seen are related to script-common code points, in particular punctuation. Though this maybe now be irrelevant, since Adobe is moving away from their old shaping engine, I have seen the presence of Arabic punctuation alter how word boundaries are recognized. In this case, final forms for a connected script like N'ko or Adlam, preceding punctuation are substituted by medial forms. As discussed in the meeting, this may be related to N'ko and Adlam script tags not being universally recognized.

In an Arabic font, we have Latin specific forms for most punctuation that is substituted by a locl feature. In different implementations these are not reliably applied. We often implement a script agnostic stylistic set as a poor workaround for users that are savvy enough select it and are using a page layout program that supports it.

NorbertLindenberg commented 3 years ago

Is there any particular reason why this issue got separated from #37?

NorbertLindenberg commented 3 years ago

Re “itemization”: I’d prefer the word “segmentation”, because it more clearly refers to a whole (here a sequence of characters) being divided into pieces, and because the Unicode standard already uses it to define several ways to break strings (see UAX 29). Either word needs a modifier to clarify by which criteria we break the text.

Re CoreText: CoreText supports both OpenType and AAT fonts. When rendering with an OpenType font, it has to apply script and cluster segmentation just like any other OpenType implementation. When rendering with an AAT font, it doesn’t, as far as I can tell as a user.

tiroj commented 3 years ago

Re “itemization”: I’d prefer the word “segmentation”,

I have used both together: itemisation is the analysis; segmentation is the outcome.

mhosken commented 3 years ago

Just a minor clarification on the Graphite script question. Yes Graphite does have the ability to search for a sequence of passes based on script, and therefore to hold more than one set, but we have never implemented such a search, the engine only ever works with the first set of passes specified.

Nobody has every complained about this because nobody has made a font with more than one set of passes. Looking forward, we also don't see this being a major issue given that we have a mechanism to work out which passes to run based on the contents of the string being shaped. So passes for different scripts can be intermingled in the same sequence with little added cost, and that can help resolve the common sequence question.

The common sequence question is: where does one place a script boundary in a sequence of characters from one script, followed by some common characters followed by characters from another script?

w3c / font-text-cg

Itemization #44