Extended unicode rules for identifier and infix operator names

[x] I have read CONTRIBUTING.md.
[x] I have checked that there is no existing PR/issue about my proposal.

Summary

Previously discussed on Discord: https://discord.com/channels/827106007712661524/827110256130916352/916364831047688252

This proposal is to change the rules for names in Idris 2 in order to be less surprising, more friendly to users of non-Latin scripts, and improve mathematical expressiveness. The rules should be unambiguous and clearly specified.

Motivation

The status quo is not great in my opinion. Names get flagged as infix operators when they shouldn't be. This is likely to be an unpleasant surprise for users coming from Agda, as well as other languages with strong Unicode support. Better support for Unicode would let users use familiar math symbols instead of ugly ASCII approximations.

It also prevents users from writing variable names in scripts other than Latin. This is a deficiency of several programming languages, but it is often for legacy compatibility reasons. There is no reason why Idris 2 should be one of these languages, in my opinion.

Note that Idris 2 uses the case of a character to infer implicit parameters in type signatures. Fortunately, Unicode provides an unambiguous list of code points that are considered "upper case", so this should/ not be a problem.

Julia has established Unicode name rules: https://docs.julialang.org/en/v1/manual/variables/#man-allowed-variable-names

Agda doesn't seem to have a formal spec for what is/isn't allowed, but this is their documentation: https://agda.readthedocs.io/en/v2.6.2/language/lexical-structure.html#keywords-and-special-symbols and relevant source https://github.com/agda/agda/blob/master/src/full/Agda/Syntax/Concrete/Glyph.hs

The proposal

I propose that identifiers follow these rules:

All names must be valid Unicode, regardless of file encoding. UTF-8 should be the preferred file encoding, if it isn't already required by the compiler.
All names are normalized to NFC form ("precomposed") during parsing, regardless of how they are written in the file.
A code point is considered "upper case" if it is listed in the leftmost column of this table: http://www.unicode.org/Public/UNIDATA/CaseFolding.txt. Otherwise it is "lower case", no matter how it looks.
If the name starts with any code point that is not in the "Letter" category (L*) or an ASCII underscore _, then it's an "operator name".
- Possible addendum: Operator names may not include any Letter code points.
Disallowed:
- Isolated combining characters (M*). Regex to detect these: https://stackoverflow.com/q/70086015/2954547.
- Whitespace (Z*).
- Any code point in the "Open Punctuation" (Ps) and "Close Punctuation" categories (Pe), due to possible confusion with syntactic ASCII braces, such as (.
- Any code point in the "Other" category (C*).
- Any name or character that is otherwise reserved, such as the variable name =, or a variable name containing @, and the ASCII backtick ` (which I cannot put into code formatting due to Markdown syntax).

The standard library should make all functionality accessible with plain ASCII names. However, Unicode aliases could be introduced, for common mathematical operators, e.g. the minus sign − as an alias for the ASCII "hyphen-minus" -.

Examples

Names starting with lower-case letters:

status
summaryData
lycéeFrançais
temp℉
ფერი
print->serve
print→serve

Names starting with upper-case letters:

STATUS
SummaryData
LycéeFrançais
Temp℉
Ⴔერი
Print->Serve
Print→Serve

Operator names:

+
>>=
⋯=
⨃
+s+ (prohibited under 3.1)

Disallowed:

Contains isolated combining characters: bad\u001F\u0303
Contains whitespace: add 1

Technical implementation

This might necessitate a Unicode compatibility interface for backends to implement. Currently, Unicode and string handling in general is backend-native (and therefore non-portable!).

Probably also relates to https://github.com/idris-lang/Idris2/issues/489.

Alternatives considered

You could just make Idris 2 ASCII-only. Unicode is fun and cosmetically pretty, but it poses compatibility issues for people without access to the right input methods and/or special editor support.

You could also permit a smaller subset of Unicode, e.g. selecting narrower character ranges. This is what Python does: https://docs.python.org/3/reference/lexical_analysis.html#identifiers.

This proposal does not account for the possible ambiguity that might arise if an infix operator happens to also be contained within a name, leading to parsing ambiguity. For example, is a∇b a single identifer, or infix notation for (∇) a b? This is already a problem in some cases relating to operators, but this proposal will only make it worse. You could prohibit specific common code points or ranges from non-operator names, such as ASCII - and +, but this is ad-hoc and doesn't cover cases that involve non-builtin operators.

Conclusion

I believe that this proposal will make Idris 2 more enjoyable to use!

Some discussion is still required, especially related to the infix operator ambiguity described in the prior section.

For reference, here's another similar unicode discussion in the Discord: https://discord.com/channels/827106007712661524/827106088343175180/952957979462893638

As you mention yourself, we could also make Idris2 ASCII-only. I actually think this is a great idea for exactly the reasons you mention! ^^
(Although I'm unsure about stuff like supporting "minor" Latin-script modifications like accents àéîñü and local characters ðþæøå; is that technically unicode support?)

For mathematical symbols, what about the idea of getting rid of \ as notation for lambdas in order to gain the ability to define things like \oplus or \Pi? It would then be up to the editor or documentation renderer how to display these. It might be a bit ugly (and possibly introduces some problems with escape sequences), but it's still ASCII and hence readable when you don't know the terminology. And it doesn't introduce complexity in the compiler for handling unicode symbols (which, as far as I understand, are not completely trivial to deal with internally) : )

Now, I'm biased and generally dislike unicode/maths symbols in code, but I'll try to detail+explain why I dislike them.

Code readability: Code should be pronounceable

The first reason is to do with ease-of-use and friendliness to newcomers/encouraging adoption of the language. While I can definitely see how having unicode support is convenient and nice if you are used to writing maths on paper or typesetting it in some form of TeX, I don't think it improves the readability of your code much (apart from for the people who are already intimately familiar with your research area), at the cost of heavily reducing its readability for people who happen to be interested/want to take a brief look.

I said this when the discussion came up in the Discord, but I'll re-state it here for reference:

When reading through source code, being able to just read it as words (e.g. "then we have f mapped over xs") is much easier and faster to work with than having to figure out what a symbol means, how it should be pronounced, etc. (e.g. "then we have f... weird <$> operator... xs"). If I'm already trying to understand what the code does (and/or walking through it with someone), I don't want to additionally have to worry about the fact that I can't easily refer to half the symbols in the code because I don't know how to pronounce them.

I think here's already plenty space for confusion with [| |] and >>= which are impossible to look up if you don't happen to know that they're called "idiom brackets" and "bind" respectively...

Idiom brackets nest the issue, since the docs explain that they can be desugared to a series of <*>. Unless you know that <*> is pronounced "apply", then you're back at square one in terms of readability.

By contrast, I think writing names out to be readable flows better, at the cost of taking up slightly more space. Which, speaking of writing, brings me to my next point.

Code writeability: How do you write unicode symbols?

I know some people create new keyboard layouts for these, or use Emacs which (as far as I understand) has built-in support for entering LaTeX-style macros and having them replaced with unicode symbols (e.g. typing \oplus and getting ⊕ in the buffer.

The issue with the first solution is that it reduces ease of adoption: If I want to try out a new programming language, I don't want to first have to define a new keyboard layout for all the symbols I'll need. Unless the language I'm looking at is the only language which does what I want, I'll find a similar one which is more easy to use with my regular keyboard.

The second solution ties the language to a specific editor. Which is also not good imo; you should be able to write code in whichever editor you prefer.

Instead, if you need custom operators, then why not just name them as such? If you're defining custom addition over pairs, then call it pairAdd or something similar. It's immediately readable, and (hopefully) also hints as to what it does. You can then define syntactic sugar or ASCII shortcuts for these (e.g. (>>=) = bind), but I believe their initial definition should be pronounceable.

What about opt-in?

You could of course have unicode as an opt-in feature, or use the argument that not everyone has to use unicode symbols in their code.

The problem with this is that if unicode becomes popular for a subset of the packages and you want to use one of these packages, now you suddenly need unicode support. Not because you wanted it, but because of an external dependency. I think this is bound to happen sooner rather than later, and so see the opt-in option as only opt-in until it suddenly isn't.

In general

In general I just think that the costs outweigh the benefits. Especially if we want Idris to (try to) be a general purpose programming language, rather than a maths/proof tool. However, unicode support seems to be a somewhat frequent thing people bring up, so there's probably something I'm missing : )

@edwinb has previously been opposed to adding unicode support to Idris: https://github.com/idris-lang/Idris-dev/pull/694#issuecomment-29559291

But I don't know if he's changed his stance on that since then : )

One immediate thought, I have others, with this proposal: Even if we expand/address Idris2's Unicode support, it might be prudent to establish/codify, and explicitly mention for the reason's already pointed as to why, a convention for packages that we ship with Idris2. That is: identifiers/operators are ASCII and lowercase/uppercase conventions et cetera. I think as a community we should not force people too much in one direction with their own projects, but we can make reasonable restrictions on Idris2's set of libs.

@CodingCellist you make a great point about pronounceability. This is something that really bothers me about Haskell, for example.

If anything, being able to use proper Unicode math symbols should help the pronounceability issue; rather than inventing cute looking sequences of ASCII punctuation, you can use established symbols with existing established names.

I also think that avoiding unpronounceable symbols as identifiers is more of a community cultural issue than something that can be enforced by the language. All of the notorious Haskell symbols are ASCII after all! Reducing the "symbol soup" is one of many reasons why I think Idris has a much nicer prelude and standard library than Haskell.

I also want to emphasize that this proposal is as much about writing expressive and international-friendly identifier names as it is about fancy infix operators. For example, using Greek letters which are common across a variety of math and engineering disciplines, or writing variable names in one's native language.

The point about input methods and text editor/font compatibility is the biggest obstacle in my opinion, and is the biggest point in favor of making Idris ASCII-only. As it stands currently, it is _already) possible to use names in libraries that most people cannot type; but they are restricted to infix operators. So if Idris will not transition to a more mature form of Unicode support, then I agree that it should simply be removed entirely.

As with the point about obscure symbols, using non-ASCII names in library public interfaces rightly should be frowned upon. I think this is the kind of thing where the community will enforce it on their own (as per @jfdm), because nobody will use a library that nobody physically can use because they can't type the names. Maybe the compiler could even omit a stern warning if such a name is exported from a module, or perhaps non-ASCII names should not be exportable at all. This last option could be really interesting when you consider how name-mangling should work in code generation. It also would allow people to use arbitrary Unicode in their internal code, but would prevent people from writing unusable libraries.

@gwerbin replacing unpronounceable unicode with unpronounceable symbol soup is indeed a problem. I guess when I say "ASCII only" what I actually mean is "plaintext only" : )

International/Non-English programming is something I think we should definitely enable, and if I understand you correctly, that requires unicode support (e.g. if I wanted to write my code in Danish, using æ in a name would require unicode support). Interesting!

I disagree that being able to use proper unicode math symbols generally helps with pronounceability. In my mind, all that does is raise the bar to entry: there may well be an agreed-upon pronunciation for a maths symbol, but if you don't have a maths degree (or don't have one in the relevant field), then you suddenly need to learn a bunch of symbols and their pronunciation, when you could've just read the code if it were plaintext. Having A ⊂ B is more difficult to read/pronounce than simply having a 'subset' B (or subset A B). You might not know what the term "subset" means, but at least you'd still be able to refer to it if you were thinking through the code or going over it with a friend or colleague.

The problem with libraries is what happens if someone writes an extremely useful or sought-after (or both!) library that uses unicode? Either the community then has to port it to ASCII, which could have been avoided if unicode wasn't allowed in the first place, or everyone who needs this functionality now has to use unicode. Neither option is great imo : /
The idea of disallowing exported names to contain unicode is really interesting! Although that might also hinder international programming if you can't structure your code into modules when you want to program not in English... Maybe a "library export" annotation or something could be used?... Anyway, I'm just thinking aloud here. I think it could be interesting to explore! : )

As far as I understand, we already have some unicode support, it's just not very... good? formalised? detailed? all of the above? And that should definitely be improved/clarified!
In terms of allowing unicode operators, I spoke with @edwinb recently and he said that unicode operators are out of the question; he hasn't changed his stance from https://github.com/idris-lang/Idris-dev/pull/694#issuecomment-29559291.

So, operators aside, if we want to work towards improving unicode support in "regular" names, what stuff do we need? Do we need a library to deal with parsing and/or handling unicode code points? And I guess we should formalise some rules somewhere (maybe along the lines of what you suggested in the proposal)? : )

International/Non-English programming is something I think we should definitely enable, and if I understand you correctly, that requires unicode support (e.g. if I wanted to write my code in Danish, using æ in a name would require unicode support). Interesting!

So, operators aside, if we want to work towards improving unicode support in "regular" names, what stuff do we need? Do we need a library to deal with parsing and/or handling unicode code points? And I guess we should formalise some rules somewhere (maybe along the lines of what you suggested in the proposal)? : )

Fortunately Idris 2 already considers a Char to be a Unicode code point, so that's already taken care of.

The main problem from a user perspective is that any character outside the ASCII range is currently considered upper case. The problem specifically seems to be here: https://github.com/idris-lang/Idris2/blob/523c0a/src/Parser/Lexer/Common.idr#L25-L26. This means that æ, α, б, and þ are all considered upper-case in type signatures, which means that if they are at the start of a name, the name won't be inferred as an implicit parameter.

You can fix this by pulling in that big Unicode capital letter table, or at least a subset thereof to handle upper-case letters in common scripts (Greek and Cyrillic?), and then assuming that any letter outside the known-upper-case set is lower case.

Alternatively, I think you could just remove the x > chr 160 check from the Capitalized case, which would cause all non-ASCII letters to be considered lowercase instead. This flips the behavior around, but probably is less surprising in general than the current behavior.

Standardizing Unicode string support across compiler backends might be a bigger issue (especially RefC), but fortunately most programming languages support Unicode strings "out of the box". However, the string representation of identifiers I think is mostly a separate problem, and I think these are two separate topics.

@gwerbin replacing unpronounceable unicode with unpronounceable symbol soup is indeed a problem. I guess when I say "ASCII only" what I actually mean is "plaintext only" : )

I disagree that being able to use proper unicode math symbols generally helps with pronounceability. In my mind, all that does is raise the bar to entry: there may well be an agreed-upon pronunciation for a maths symbol, but if you don't have a maths degree (or don't have one in the relevant field), then you suddenly need to learn a bunch of symbols and their pronunciation, when you could've just read the code if it were plaintext. Having A ⊂ B is more difficult to read/pronounce than simply having a 'subset' B (or subset A B). You might not know what the term "subset" means, but at least you'd still be able to refer to it if you were thinking through the code or going over it with a friend or colleague.

The problem with libraries is what happens if someone writes an extremely useful or sought-after (or both!) library that uses unicode? Either the community then has to port it to ASCII, which could have been avoided if unicode wasn't allowed in the first place, or everyone who needs this functionality now has to use unicode. Neither option is great imo : / The idea of disallowing exported names to contain unicode is really interesting! Although that might also hinder international programming if you can't structure your code into modules when you want to program not in English... Maybe a "library export" annotation or something could be used?... Anyway, I'm just thinking aloud here. I think it could be interesting to explore! : )

As far as I understand, we already have some unicode support, it's just not very... good? formalised? detailed? all of the above? And that should definitely be improved/clarified! In terms of allowing unicode operators, I spoke with @edwinb recently and he said that unicode operators are out of the question; he hasn't changed his stance from idris-lang/Idris-dev#694 (comment).

Good points all. I am a lot more interested in better name support anyway, than I am in fancy operators.

Non-letter non-ASCII characters like ⊌ are currently disallowed in names, but I am not sure where this logic is implemented in the code, so I can't comment on how precisely correct (or incorrect) the logic is.

being able to say the symbol

In PureScript all operators are required to have a name. You see both infix and the named operator used in contexts where they make more sense, and honestly you see map more than you do <$>. There’s also the question of what defines “fancy” symbols… a lot of keyboards don’t even have a $ key (and the standard Thai layouts make it a pain to type ```).

idris-lang / Idris2