Closed n8willis closed 2 years ago
I've done a Unicode 14 update to Allsorts. This mostly involved updating the various data used from the UCD as well as the following:
Aside from this I've not made any behavioural changes to the shaping engine.
* Update the list of Arabic chars that are modifier combining marks. * Update the shaping class according to your updates here.
Great! Were there any surprises to be found in the Arabic MCM list? (I don't know; mostly I'm just curious if you called it out for some specific reason)
This merge brings the data up to Unicode 14. The greatest number of changes are found in Arabic, though, so any users of that script-specific shaping info would be wise to look it over with extra scrutiny and, if something looks off, to open an issue.
Were there any surprises to be found in the Arabic MCM list? (I don't know; mostly I'm just curious if you called it out for some specific reason)
I don't think so. I mentioned it because I noticed that there were new code points listed in https://www.unicode.org/reports/tr53/tr53-6.html#MCM which seemed to roughly coincide with the Unicode 14 release, so I bundled the change into my Unicode 14 updates.
That makes sense. AFAICT, HarfBuzz hasn't yet added those MCM additions, but that might be related to the (known) mismatch between UTR#53 and HarfBuzz normalization. I should reread that.
HarfBuzz hasn't yet added those MCM additions,
I think it is just an oversight, I made a PR to add them https://github.com/harfbuzz/harfbuzz/pull/3422
Okay; great. I had also noticed that the UTR53 update was published a while after the Unicode 14 PR, so that makes sense.
Small change, so I pushed that directly in beffad8.
This updates the character tables for the Arabic, Kannada, Mongolian, and Telugu docs to reflect additions in Unicode v 14, including new codepoints and the corresponding Indic Positional / Indic Syllabic / Arabic Shaping / general-UCD info.
I believe these are the only scripts affected by the updated release. Please speak up if I have overlooked something.
Note that for Arabic there is an entirely new block (Extended-B) and some additional Joining Groups.
I don't believe that there were major changes to the info on existing codepoints (the delta charts seem to reflect mostly representative glyph updates ...) but that is worth a separate pass anyway; new codepoints are (at least) self-contained and not likely to break existing implementations.
Note also that this update should be considered "raw" info. Several minor changes may have behavioral effects that will be discovered and sorted out by implementers. Will watch for such information from HarfBuzz and AllSorts, among others!
Of particular note in this respect is the fact that Kannada and Telugu have now acquired codepoints for a
CONSONANT_DEAD
letter, Nakaara Pollu. There is an existing issue on that letter, #116, which has so far received no comments. If it affects syllable-id or shaping, that will probably mean revision to the actual shaping docs for those scripts.