Open levbishop opened 1 year ago
This all sounds logical to me. I think the killer for NFKC is the handling around the variants of things like \phi
/\varphi
- they may technically be the same letter, but in my experience I've seen physicists make a deliberate distinction between them to refer to different things when running out of letters, so we may as well just use the normalisation form that doesn't impact that. We can just explicitly allow both MICRO_SIGN 's'
and GREEK_LETTER_MU 's'
as synonyms for us
without needing them to be part of the Unicode normalisation we choose.
If the only objection to NFKC is an easily-specified subset of normalizations but all the other NFKC choices are otherwise reasonable then UAX31-R4/UAX31-R6 do allow specifying a set of characters to exclude from normalization/filtering, which could include \varphi
, \varrho
, other variant forms, maybe some of the mathematical variants, maybe some of the letterlike symbols.
After excluding those, the remaining 1351 NFKC normalizations don't seem too controversial.
To be clear, I'm not actually advocating for this, I still think this kind of thing is better handled by PEP8-type style guides or local project-specific rules (eg a Russian project might allow only ASCII+Cyrillic with some restrictions on mixed-script identifiers)
Inspired by #428 I promised in the last TSC meeting to take a look at best practices for handling unicode equivalences in programming languages. As Jake suspected, like most things Unicode, it turns out to be a pretty involved topic.
Identifiers: UAX31
Most unicode-related discussion seems to be around which identifiers should be considered valid. There is a whole unicode annex on the topic UAX-31: UNICODE IDENTIFIER AND PATTERN SYNTAX which goes through a lot of the possible decisions/related pitfalls and gives various compliance statements that an implementation may assert to give a uniform understanding of these issues across languages/implementations.
The current OpenQasm spec identifier section is similar to a UAX31 compliance statement of the form:
Start
:=[[:XID_Start:]_]
Continue
:=[:XID_Continue:]
Medial
:=[]
Relative to the current spec, while technically a breaking change, the difference is minor and unlikely to invalidate any code in the wild:
Start
is just removal of 22 codepoints and addition of 4 codepoints.Continue
the difference is removal of 18 codepoints and addition of 3,130 codepointsSecurity: UTS39 and UTR36
There are a class of so-called Trojan Source attacks where source code can pass human review (and sometimes also automated analysis) but in the execution environment behave differently than the displayed form suggests. Other unicode documents that are relevant to our discussion are UTR36: UNICODE SECURITY CONSIDERATIONS and UTS39: UNICODE SECURITY MECHANISMS, which has a section about identifiers. Sticking to characters with
Identifier_Status=Allowed
as defined there removes 27323 codepoints from the set ofContinue
characters, for various reasons.A problem with defining things this way is that there are not forward or backward-compatibility guarantees around the
Identifier_Status
and characters can have their status change in future revisions, which seems undesirable for a programming language stability. We could freeze a version of UTS39, but then we wouldn't get the benefit of updated protection against newly-discovered security issues. There is an interesting proposal IPIC9 that makes the case for this (see also the two follow-up posts with more reasoning about punctuation and immutability), but this never seem to catch on in the wild and ultimately I don't like it for general purpose programming source code.Normalization
All the stuff I read basically agreed that the only sensible options are NFC and NFKC. The argument for NFC is that NFC normalization just gets rid of meaningless and invisible differences, and most input methods don't give any reasonable way to generate sequences that are different but NFC-equivalent. NFC normalization of
XID_Continue
eliminates 1098 code points the ones that jump out beingwhich seems obviously correct to me. Important to our discussion NFC does not for some reason normalize
NFKC normalization is less obvious. Beyond the codepoints eliminated by NFC, It eliminates 2532 additional codepoints Some of the normalization seems obviously correct like:
which seem much like the above choices around OHM SIGN etc. Some of the normalization seems wrong, like
\varphi
,\varrho
etcwhere the distinction is the whole point. Some cases are ambiguous like the mathematical doublestruck symbols which are clearly visually distinguishable, or the superscript/subscript letters/numbers.
I guess the problem is that NFKC normalization was not built around the needs of programming languages (and what I understand of the unicode stability guarantees this is not something that can be "fixed" in the future). One argument against NFKC is that editor/command line search tools don't usually support NFKC-equivalent searching, so "find all the uses of the identifier by name of X" needs specific tool support. UTS55 below gives some examples of the kind of thing can go wrong there.
A rule of thumb (formalized in the draft UTS55) I've seen is that case-insensitive languages should use NFKC and case-sensitive languages should use NFC.
A final decision is whether to use equivalent normalized comparisons (where the input stream is converted to the normalization form before further processing) or filtered normalization (where having codepoints outside of the normalization form in the input is an error). Choosing the latter avoids the searching-for-identifiers problem from above, but disallows formatting the identifiers the way you would like to, which is more of a problem for some scripts (eg Farsi) than others. It seems pretty obvious that the only 3 combinations that make sense are:
If we chose NFKC I wonder if it might be friendliest to actually go with a non-standard mixed equivalent/filtered normalization where the input is first NFC normalized and any remaining non-NFKC codepoints are treated as an error. Nevertheless...
General programming language considerations: UTS55
Since I started looking into this, a draft proposal was published for a new document UTS55: UNICODE SOURCE CODE HANDLING, which should be read alongside draft proposed updates to UAX31
This looks pretty comprehensive, but it will probably take a bit of effort to translate for our needs. Also it still leaves some decisions for the language designer. The section 4.1.3 on nested languages is likely relevant to
defcal
blocks.What do other languages do?
All of these in the latest versions are based on UAX31-R1 default identifiers. They all use UAX31-R4: Equivalent Normalized Identifiers, with python choosing NFKC and the rest choosing NFC. (Actually its not 100% clear if C and C++ mean UAX31-R4 Equivalent Normalized Identifiers or UAX31-R6 Filtered Normalized Identifiers, the compliance statement says Equivalent but the commentary suggests Filtered.)
Here's a python example showing this in practise: