n8willis / opentype-shaping-documents

Documentation of OpenType shaping behavior
170 stars 13 forks source link

Unicode 14 update #139

Closed n8willis closed 2 years ago

n8willis commented 3 years ago

This updates the character tables for the Arabic, Kannada, Mongolian, and Telugu docs to reflect additions in Unicode v 14, including new codepoints and the corresponding Indic Positional / Indic Syllabic / Arabic Shaping / general-UCD info.

I believe these are the only scripts affected by the updated release. Please speak up if I have overlooked something.

Note that for Arabic there is an entirely new block (Extended-B) and some additional Joining Groups.

I don't believe that there were major changes to the info on existing codepoints (the delta charts seem to reflect mostly representative glyph updates ...) but that is worth a separate pass anyway; new codepoints are (at least) self-contained and not likely to break existing implementations.

Note also that this update should be considered "raw" info. Several minor changes may have behavioral effects that will be discovered and sorted out by implementers. Will watch for such information from HarfBuzz and AllSorts, among others!

Of particular note in this respect is the fact that Kannada and Telugu have now acquired codepoints for a CONSONANT_DEAD letter, Nakaara Pollu. There is an existing issue on that letter, #116, which has so far received no comments. If it affects syllable-id or shaping, that will probably mean revision to the actual shaping docs for those scripts.

wezm commented 2 years ago

I've done a Unicode 14 update to Allsorts. This mostly involved updating the various data used from the UCD as well as the following:

Aside from this I've not made any behavioural changes to the shaping engine.

n8willis commented 2 years ago
* Update the list of Arabic chars that are modifier combining marks.

* Update the shaping class according to your updates here.

Great! Were there any surprises to be found in the Arabic MCM list? (I don't know; mostly I'm just curious if you called it out for some specific reason)

n8willis commented 2 years ago

This merge brings the data up to Unicode 14. The greatest number of changes are found in Arabic, though, so any users of that script-specific shaping info would be wise to look it over with extra scrutiny and, if something looks off, to open an issue.

wezm commented 2 years ago

Were there any surprises to be found in the Arabic MCM list? (I don't know; mostly I'm just curious if you called it out for some specific reason)

I don't think so. I mentioned it because I noticed that there were new code points listed in https://www.unicode.org/reports/tr53/tr53-6.html#MCM which seemed to roughly coincide with the Unicode 14 release, so I bundled the change into my Unicode 14 updates.

n8willis commented 2 years ago

That makes sense. AFAICT, HarfBuzz hasn't yet added those MCM additions, but that might be related to the (known) mismatch between UTR#53 and HarfBuzz normalization. I should reread that.

khaledhosny commented 2 years ago

HarfBuzz hasn't yet added those MCM additions,

I think it is just an oversight, I made a PR to add them https://github.com/harfbuzz/harfbuzz/pull/3422

n8willis commented 2 years ago

Okay; great. I had also noticed that the UTR53 update was published a while after the Unicode 14 PR, so that makes sense.

n8willis commented 2 years ago

Small change, so I pushed that directly in beffad8.