Universal Shaping Engine data files

simoncozens commented 3 years ago

The data tables which drive the USE are derived from (but not supplied by) Unicode, and are used by (but not exactly specific to) OpenType. (When I say they are not exactly specific to OpenType, there is no OT-related information in those USE categories, and in theory they could be useful inputs to another shaping technology.)

So they're neither defined by one nor the other; they exist at the interface between text encoding and font formats - which is the precise space that this CG is intended to cover. Would it make sense for this CG, or a group sponsored by it, to maintain and expand this data?

We need to maintain a more informative and more flexibly structured database for Unicode Indic chars’ identities, to properly support both the UCD properties InSC/InPC and the USE spec. Keeping piling up USE overrides while not being able to record nuances in the UCD is scary.
— 梁海 Liang Hai (@lianghai) October 7, 2020

lianghai commented 3 years ago

USE as a scope/target is too small, as it doesn’t currently handle the more widely used (and thus better understood) Indic scripts. A more appropriate scope would be at least enclosing what the https://github.com/n8willis/opentype-shaping-documents project is about (cc @n8willis).

In my mind, it’s also problematic to put too much emphasis on font implementational technologies, as USE, etc, although not too much specific to the OTL, are still overly restrictive. The database in my mind needs to be more low level, closer to the Unicode Standard, but informative enough to successfully record all the data needed for exporting the data files needed by USE, etc.

simoncozens commented 3 years ago

A more appropriate scope would be at least enclosing what the https://github.com/n8willis/opentype-shaping-documents project is about (cc @n8willis).

I agree, and that's why I started #11. Full shaping documentation is the scope for that project. I still think USE - or at least "USE-like data" - is worth thinking about separately. For one thing, having a small and well-defined scope for a project is a good thing!

There's obviously a set of information which is required for shaping, which UCD doesn't provide because it's too shaping-specific and OFF doesn't provide because it's dependant on Unicode. It makes sense for an intermediary group with liasons on both sides to manage that data.

lianghai commented 3 years ago

I agree, and that's why I started #11. Full shaping documentation is the scope for that project.

Umm, there’s probably some misunderstanding then. In my understanding, the https://github.com/n8willis/opentype-shaping-documents project is for documenting and specifying shaper behavior (how fonts are shaped), while my tweet was talking about the encoding level, where each character’s identity is defined and maintained, so the expectation of how characters should be shaped is clear. (The bridge between the two, ie, how to produce fonts for the characters to be shaped by shapers, is the http://github.com/typotheque/text-shaping project’s scope.)

Encoding is supposed to be the fixed target here, because the Unicode Standard is all about universality, interchangeability, etc. Shaper specifications like the USE are just implementations of expected shaping behavior (currently for many scripts are only implied by the Unicode Standard), and are different from how the characters’ identities are defined. My tweet was exactly about how

The data tables which drive the USE are derived from (but not supplied by) Unicode, and are used by (but not exactly specific to) OpenType.

Would it make sense for this CG, or a group sponsored by it, to maintain and expand this data?

Now having had a second look, I’m not quite sure what exact date you were talking about… I assumed you were taking about what are derived from the UCD InSC/InPC property values + USE’s overrides under USE’s InSC/InPC-based character classification rules. If you meant that, yes, that’s not specific to OT/OTL, but is specific to USE, and the USE is a rather specific take about how Unicode characters should be shaped. USE’s character classification rules are all for supporting its shaping processes.

While on the UCD InSC/InPC side, these properties are still largely an exercise and a bookkeeping method so we note down some relevant information about an Indic character’s property. InSC/InPC were some groundbreaking work but still they’re limited by a slightly legacy mindset, and the data structure doesn’t allow them to capture all the information we want to record.

This is why I wrote that tweet—InSC/InPC are too rigid, while USE’s data files are too much oriented to a specific implementation. Therefore, whatever data files people outside of the USE want to maintain, shouldn’t be designed for USE. The missing piece is on a much lower level, about encoding, and the data files should aim for recoding enough nuances so the UCD InSC/InPC can be reproduced, and the USE should just continue to consume the UCD InSC/InPC values, until an architecture change of the USE is planned.

w3c / font-text-cg

Universal Shaping Engine data files #16