Clarify unicode support

levbishop commented 1 year ago

Inspired by #428 I promised in the last TSC meeting to take a look at best practices for handling unicode equivalences in programming languages. As Jake suspected, like most things Unicode, it turns out to be a pretty involved topic.

Identifiers: UAX31

Most unicode-related discussion seems to be around which identifiers should be considered valid. There is a whole unicode annex on the topic UAX-31: UNICODE IDENTIFIER AND PATTERN SYNTAX which goes through a lot of the possible decisions/related pitfalls and gives various compliance statements that an implementation may assert to give a uniform understanding of these issues across languages/implementations.

The current OpenQasm spec identifier section is similar to a UAX31 compliance statement of the form:

UAX31-C1: The OpenQASM language conforms to version 37 of the Unicode® Standard Annex #⁠31
UAX31-C2: It observes the following requirements:
- UAX31-R1-2. Default Identifiers: To determine whether a string is an identifier it uses UAX31-D1 with the following profile:
- Start := [[:XID_Start:]_]
- Continue := [:XID_Continue:]
- Medial := []
- UAX31-R1b. Stable Identifiers: Once a string qualifies as an identifier, it does so in all future versions.

Relative to the current spec, while technically a breaking change, the difference is minor and unlikely to invalidate any code in the wild:

The difference for Start is just removal of 22 codepoints and addition of 4 codepoints.
For Continue the difference is removal of 18 codepoints and addition of 3,130 codepoints

My thought: I'm happy to assume there are good reasons for these differences that deep thinkers who authored UAX31 have considered, and just modify OpenQASM to use XID_Start, XID_Continue.

Security: UTS39 and UTR36

There are a class of so-called Trojan Source attacks where source code can pass human review (and sometimes also automated analysis) but in the execution environment behave differently than the displayed form suggests. Other unicode documents that are relevant to our discussion are UTR36: UNICODE SECURITY CONSIDERATIONS and UTS39: UNICODE SECURITY MECHANISMS, which has a section about identifiers. Sticking to characters with Identifier_Status=Allowed as defined there removes 27323 codepoints from the set of Continue characters, for various reasons.

A problem with defining things this way is that there are not forward or backward-compatibility guarantees around the Identifier_Status and characters can have their status change in future revisions, which seems undesirable for a programming language stability. We could freeze a version of UTS39, but then we wouldn't get the benefit of updated protection against newly-discovered security issues. There is an interesting proposal IPIC9 that makes the case for this (see also the two follow-up posts with more reasoning about punctuation and immutability), but this never seem to catch on in the wild and ultimately I don't like it for general purpose programming source code.

My thought: It's useful to read UTS39/UTR36 for other security considerations, but the Identifier_Status is too unstable for language spec use. It can factor into linter rules, syntax highlighting in programmer editors, etc. Identifier_Status maybe makes sense for things where the identifiers must be used directly, such as hashtags or international domain names, but for programming languages this can delegate to style guides and external tooling rather than the language spec and compiler. (And there are many other unicode security issues such as bidi, line breaking, confusable identifiers etc, that have to be handled in a more global way than simply at the identifier lexing stage, so such tooling will be necessary no matter what). This is the philosophy taken by most other languages: python, rust, etc

Normalization

All the stuff I read basically agreed that the only sensible options are NFC and NFKC. The argument for NFC is that NFC normalization just gets rid of meaningless and invisible differences, and most input methods don't give any reasonable way to generate sequences that are different but NFC-equivalent. NFC normalization of XID_Continue eliminates 1098 code points the ones that jump out being

Ω U+2126 OHM SIGN
K U+212A KELVIN SIGN
Å U+212B ANGSTROM SIGN

which seems obviously correct to me. Important to our discussion NFC does not for some reason normalize

µ U+00B5 MICRO SIGN

NFKC normalization is less obvious. Beyond the codepoints eliminated by NFC, It eliminates 2532 additional codepoints Some of the normalization seems obviously correct like:

ﬀ U+FB00 LATIN SMALL LIGATURE FF
ⅻ U+217B SMALL ROMAN NUMERAL TWELVE
µ U+00B5 MICRO SIGN

which seem much like the above choices around OHM SIGN etc. Some of the normalization seems wrong, like

𝜛 U+1D71B MATHEMATICAL ITALIC PI SYMBOL
similarly for \varphi, \varrho etc

where the distinction is the whole point. Some cases are ambiguous like the mathematical doublestruck symbols which are clearly visually distinguishable, or the superscript/subscript letters/numbers.

I guess the problem is that NFKC normalization was not built around the needs of programming languages (and what I understand of the unicode stability guarantees this is not something that can be "fixed" in the future). One argument against NFKC is that editor/command line search tools don't usually support NFKC-equivalent searching, so "find all the uses of the identifier by name of X" needs specific tool support. UTS55 below gives some examples of the kind of thing can go wrong there.

A rule of thumb (formalized in the draft UTS55) I've seen is that case-insensitive languages should use NFKC and case-sensitive languages should use NFC.

A final decision is whether to use equivalent normalized comparisons (where the input stream is converted to the normalization form before further processing) or filtered normalization (where having codepoints outside of the normalization form in the input is an error). Choosing the latter avoids the searching-for-identifiers problem from above, but disallows formatting the identifiers the way you would like to, which is more of a problem for some scripts (eg Farsi) than others. It seems pretty obvious that the only 3 combinations that make sense are:

UAX31-R4 Equivalent Normalized Identifiers using Normalization Form C (NFC).
UAX31-R4 Equivalent Normalized Identifiers using Normalization Form KC (NFKC).
UAX31-R6 Filtered Normalized Identifiers using Normalization Form KC (NFKC).

If we chose NFKC I wonder if it might be friendliest to actually go with a non-standard mixed equivalent/filtered normalization where the input is first NFC normalized and any remaining non-NFKC codepoints are treated as an error. Nevertheless...

My thought: I don't love it, but we should go with the flow of other languages and UTS55 and use UAX31-R4 Equivalent Normalized Identifiers using Normalization Form C (NFC). As a special case, since there are a few greek letters OpenQasm treats specially (pi, mu, etc) we should just clarify in the spec exactly which of their variants we accept as synonyms. Much as the proposed #428 does already. Any remaining ambiguities should be handled in style-guides and linters, etc.

General programming language considerations: UTS55

Since I started looking into this, a draft proposal was published for a new document UTS55: UNICODE SOURCE CODE HANDLING, which should be read alongside draft proposed updates to UAX31

This looks pretty comprehensive, but it will probably take a bit of effort to translate for our needs. Also it still leaves some decisions for the language designer. The section 4.1.3 on nested languages is likely relevant to defcal blocks.

My thought: This is fiddly stuff, and we can surely find issues to quibble with the recommendations here, but unicode is fundamentally messy and unsatisfying and we aren't interested in being unicode experts, so outsourcing the decision-making here is probably the right choice so we should follow UTS55 once it publishes. Another reason to follow UTS55 is that a lot of the recommendations there spread the responsibilities among the compiler/spec and other tooling IDEs/text editors/pretty printers/linters etc so if we deviate from eg the line break rules of UTS55 then unless editors implement OpenQASM-specific linebreaking rules, invisible character highlighting rules..., there would likely be deviations between the editor and compiler's ideas about the extent of comments, etc.

What do other languages do?

All of these in the latest versions are based on UAX31-R1 default identifiers. They all use UAX31-R4: Equivalent Normalized Identifiers, with python choosing NFKC and the rest choosing NFC. (Actually its not 100% clear if C and C++ mean UAX31-R4 Equivalent Normalized Identifiers or UAX31-R6 Filtered Normalized Identifiers, the compliance statement says Equivalent but the commentary suggests Filtered.)

Here's a python example showing this in practise:

μ𝛍µ𝝁𝜇𝞵𝝻 = 1
μμμμμμμ += 10 # the same variable by a different display name, openqasm should treat them as distinct
print(μμμμμμμ) # 11
π𝞏𝛑ᴨ𝝅𝜋𝟉𝝿ϖ𝛡𝝕ℼ𝜛𝞹 = 1 # but not including п
πππᴨππππππππππ += 20
print(πππᴨππππππππππ) # 21

jakelishman commented 1 year ago

This all sounds logical to me. I think the killer for NFKC is the handling around the variants of things like \phi/\varphi - they may technically be the same letter, but in my experience I've seen physicists make a deliberate distinction between them to refer to different things when running out of letters, so we may as well just use the normalisation form that doesn't impact that. We can just explicitly allow both MICRO_SIGN 's' and GREEK_LETTER_MU 's' as synonyms for us without needing them to be part of the Unicode normalisation we choose.

levbishop commented 1 year ago

If the only objection to NFKC is an easily-specified subset of normalizations but all the other NFKC choices are otherwise reasonable then UAX31-R4/UAX31-R6 do allow specifying a set of characters to exclude from normalization/filtering, which could include \varphi, \varrho, other variant forms, maybe some of the mathematical variants, maybe some of the letterlike symbols.

After excluding those, the remaining 1351 NFKC normalizations don't seem too controversial.

To be clear, I'm not actually advocating for this, I still think this kind of thing is better handled by PEP8-type style guides or local project-specific rules (eg a Russian project might allow only ASCII+Cyrillic with some restrictions on mixed-script identifiers)

openqasm / openqasm