n8willis / opentype-shaping-documents

Documentation of OpenType shaping behavior
161 stars 15 forks source link

Look at new regex operators from TR18 #143

Closed n8willis closed 2 years ago

n8willis commented 2 years ago

Unicode TR18 was just updated to add 'not'/'complement' operators that formally distinguish between applying to a string and applying to a particular codepoint.

The point, I think, is that the regular expressions need to be able to express "Codepoint not U+ABCD" in simple fashion but have that not match "literally any string other than U+ABCD". So I do wonder if that would help simplify any of the regular expressions used for syllable or subsequence matching.

mikeday commented 2 years ago

Interesting, that's a lot of set theoretic machinery to define this from scratch given that most regular expressions already allow [^a], although that's specifically a negated character class that can be interpreted as a shorthand for "the alternation of every possible character excluding 'a'" rather than a true complement operator that applies to arbitrary regexps.

n8willis commented 2 years ago

Yeah, sometimes with these Unicode documents I feel like I'm lacking some context that makes them all make clearer sense. E.g., they may be thinking about some particular RE language or system that this is a real improvement for.

n8willis commented 2 years ago

Having perused this a bit more, I don't see anything I think would affect shaping-level concerns. Possibly more useful for higher-level text handling.