whiterock commented 3 years ago

Unicode support for infix operators

Abstract

Allow certain unicode characters to be used as infix operators akin to but more restricted than Julia allows it.

Motivation

Allows for beautiful linear algebra code, can be helpful in using code to show off algorithms in a Computer Science kind of setting, is a very much loved feature of Julia, etc. The initial idea for this came from https://forum.nim-lang.org/t/2968

Description

I propose to allow the following characters to be used as infix operators:

Parsed with the same precedence as `+`

± ⊕ ⊖ ⊞ ⊟ ∪ ∨ ⊔

Parsed with the same precedence as `*`

∙ ∘ × ★ ⊗ ⊘ ⊙ ⊛ ⊠ ⊡ ∩ ∧ ⊓

which is a very small subset taken from what Julia allows: cf. https://stackoverflow.com/a/60321302/4038300 Note: This RFC should not fail on the grounds of one character here. It's just my initial proposal, feel free argue why some characters should be additionally included or which ones to be omitted.

There should be no confusion about precedence here!

Potential downsides: Libraries offering for example ⊗(x,y) but not kronecker(x,y). This should be discouraged, as to not hinder people using a library just because they cannot easily type these unicode glyphs - it should always be optional, I guess we cannot enforce this but to put it in the docs / a style guide.

Examples

So for example consider the hadamard kronecker mixed-product property:

Before

doAssert hadamard(kronecker(A,B), kronecker(C,D)) == kronecker(hadamard(A,C), hadamard(B,D))

After

doAssert (A ⊗ B) ∘ (C ⊗ D) == (A ∘ C) ⊗ (B ∘ D)

Or for people that love the brevity of math notation for logic consider:

Before

let truth = a and (b or c)

After

let truth = a ∧ (b ∨ c)

Implementation

I am not a Nim contributor yet, but I imagine the implementation would be rather easy :)

Further thoughts?

This makes all code unreadable ??? No it does not. It is optional, we choose a subset of unicode characters that are distinguishable and whose precedence is obvious. Consider also in your argument against it that we have things like the %* macro already in json.nim

@Araq suggested adding additional unicode parentheses to Nim. It's probably wiser to make a separate RFC about that? In any case it is better, if he suggests which symbols he wants and his general thoughts on it, since its his idea! :)

I could imagine lots of cool mathy things to take life in the sugar.nim package or perhaps more wisely (?) a separate package, since sugar.nim also provides these pretty abbreviations for people that appreciate that.

Considered but rejected

Parsed with the same precedence as `*`

⋆

Araq commented 3 years ago

A couple of remarks:

a ∧ (b ∨ c) won't ever make it into Nim's stdlib/core for the simple reason that and / or are more readable and very special: They are control flow, in a or b the b is not evaluated if a is true. This means they are not the common logical operators.

Unicode based operators should be reserved for other math heavy packages and the stdlib should avoid them. At least for the first couple of years.

Libraries offering for example ⊗(x,y) but not kronecker(x,y). This should be discouraged

IMO it's the opposite. If you offer ⊗, only offer ⊗, aliases are frequently more confusing and don't scale well. "Is there a difference between ⊗ and kronecker?" Now that we have ⊗= we also need kroneckerAsgn...

konsumlamm commented 3 years ago

IMO it's the opposite. If you offer ⊗, only offer ⊗, aliases are frequently more confusing and don't scale well. "Is there a difference between ⊗ and kronecker?" Now that we have ⊗= we also need kroneckerAsgn...

I disagree. Many people (including myself) wouldn't want to use unicode operators, simply because it's not easy to type. You either need some keyboard layout that supports unicode operators, or copy-paste them from the docs every time you wanna use them. So as a result, they probably just wouldn't use libraries that only provide unicode operators (without ASCII alternatives), which could be avoided very easily.

HugoGranstrom commented 3 years ago

I agree with @konsumlamm that there should be ascii alternatives offered by libraries for the base functions at least so that you could write code that performs the same tasks (although in slightly different ways) with both the Unicode and ascii alternatives. For example the a ⊗= b is equivilent (although more efficient perhaps) to a = kronocker(a, b). So both kinds of users would be able to use the library.

A note on inputting unicode charcters on Windows. Several options like Windows + . and WinCompose hs been brought up that aren't too convoluted to use. Although I haven't checked whether they support all of the proposed symbols in the RFC.

whiterock commented 3 years ago

a ∧ (b ∨ c) won't ever make it into Nim's stdlib/core for the simple reason that and / or are more readable and very special: They are control flow, in a or b the b is not evaluated if a is true. This means they are not the common logical operators.

I personally don't care if it goes into stdlib, but with regards to the special treatment / short circuiting you mention: Why would something like this not do it?

template `∧` (a, b: untyped): untyped =
  a and b

Thoughts about unicode editor support

One option that comes to mind and which should be easy is replicating what Julia does i.e. expanding e.g. \pm to ± on tab or automatically. ultisnips for nvim comes to mind, or hypersnips for vscode - should be straightforward and useful for the mathy crowd which already knows latex.

EDIT: Just saw, there is still a good discussion about that going on in the forums. Specifically this post and the following https://forum.nim-lang.org/t/2968#52222

metagn commented 3 years ago

Maybe this can become some kind of source code filter?

#? unicodeops
echo a ∧ (b ∨ c)
echo a ± b
echo a ⊗ b

var a, b, c: bool
echo `and`(a, `or`(b, c))
echo plusMinus(a, b)
echo kronecker(a, b)

Each operator would translate to a call to a name, so you don't have to define procs using the operators. I don't know if it's fine for the compiler to decide this operator-to-name table. It's also not really like the other source code filters in the changes it makes, so I don't know if it's the best fit.

a ⊗= b could also get translated to a = kronecker(a, b) instead of kroneckerAsgn(a, b), but I think it should just not be supported since it's not fully unicode. It could cause a lexer error on these operators, as well as ∧∨ and ∨+.

a-mr commented 3 years ago

@konsumlamm , @HugoGranstrom ,

There is at least 2 types of Latex-inspired input methods:

Various editor plugins for Vscode, Vim, Emacs and other editors. One can also install Julia plugin for all those editors, it works for non-Julia files also!
Use OS input methods
- For Windows you can use https://github.com/clarkgrubb/latex-input#windows-install. I've just checked it, it works. It's claimed to work for Mac OS X also (i did not check)
- For Linux there is ibus-table-latex already in repositories, it's always working for me: \otimes → ⊗. There is also similar fcitx-table-latex

Usually you just start typing sequence \ot, a completion window appears and then you press space or tab and the symbol is input!

Also note that it's called otimes, not kronecker in Latex. Since the symbol can be used for other things.

xigoi commented 3 years ago

For Vim/Neovim you don't need any plugins, it has this built-in.

:help digraphs
:help i_ctrl-k

Araq commented 3 years ago

I personally don't care if it goes into stdlib, but with regards to the special treatment / short circuiting you mention: Why would something like this not do it?

Yes but we usually model control flow via keywords (for/while/if/and/or). But even if not, we already have and/or and don't need aliases for existing operators.

aniou commented 3 years ago

I have objections about readability and distinguishability of mentioned operators characters (I'm sensitive to that topic because in past I have had enough problems with serial numbers built around small Latin 'l' and number '1'). Of course, there is an argument "You should choose proper font", but not always true - sometimes we don't have full control about presented code, used computer or we even have a different views on "readability" with person who print or publish code.

For example, following sets - for me - looks like the same, both in browser and my terminal:

⊗ ⊛ ⊕
∙ ∘
⊞ ⊠
∪ ∨

In addition, on my terminal these looks almost identical: "u ∪ U", "v∨V" and "x×": Screenshot_20210620_160237

Araq commented 3 years ago

∪ ∨ × look too much like u v x indeed. The RFC should exclude those.

Vindaar commented 3 years ago

∪ ∨ × look too much like u v x indeed. The RFC should exclude those.

I don't really see the big issue with these. They are binary operators after all, which the letters u, v and x cannot be. Thus if it appears as an infix one knows it's the unicode character from context.

edit: thinking about this more, there is one issue. Namely for normal infix operators we can write the operation without a space: a + b vs a+b. If one allows the same for these unicode operators it does become rather ambiguous.

Araq commented 3 years ago

Maybe this can become some kind of source code filter?

Sure, maybe, but it doesn't feel like an elegant solution.

whiterock commented 3 years ago

In addition, on my terminal these looks almost identical: "u ∪ U", "v∨V" and "x×":

Yes this is indeed a problem for low-res fonts or displays. I guess it boils down to the typical user of nim vs julia, they probably don't care about such things and assume young people with recent hardware and good eyesight. I am not sure this RFC was a good idea after all or rather that it just might not be such a good fit for nim, but my worries aside, ofc I (!) would still love to see it and thankfully I am not in charge to decide! :)

∪ ∨ × look too much like u v x indeed. The RFC should exclude those.

Okay, then we're down to: ± ⊕ ⊖ ⊞ ⊟ ⊔ and for consistency we should remove ∩ ∧ ∙ ∘ ⊗ ⊘ ⊙ ⊛ ⊠ ⊡ ⊓ ⋆

Would you prefer me to update the original post?

Araq commented 3 years ago

Would you prefer me to update the original post?

Yes please. Ideally you add a small "considered but rejected" section.

a-mr commented 3 years ago

Text editors can highlight operators to prevent confusing with identifiers.

BTW example in the description already highlights ∧ with blue for Nim:

let truth = a ∧ (b ∨ c)

However without spaces Github currently does not highlight ∨:

let truth = a∧(b∨c)

whiterock commented 3 years ago

Text editors can highlight operators to prevent confusing with identifiers.

True, but going down the train of thought others have started we could argue not everybody uses syntax highlighting or "real systems programmers often ssh into somewhere where there is no syntax highlighting", but then again these people are probably not the ones wanting to use such operators in the first place. hmm i don't know, it's certainly never easy to please everyone, I for one would be happy with this tiny subset we have come up with now and am certainly very much in favor!

a-mr commented 3 years ago

Any additional restrictions are meaningless because the evil of Unicode is already inside our house! Nim already supports Unicode identifiers, proc names, etc.:

const рус = 7  # guess what language is it?
echo pyc       # it will not work
echo рус       # this will

Hint: lines 1 and 3 are Russian рус (Cyrillic alphabet) and line 2 is English pyc (Latin alphabet). Guess you are like me, cannot spot the difference?

All we can hope for is common sense of library writers (and may be static analyzers).

Araq commented 3 years ago

⋆ might look too much like *, at least I cannot see the value in "here is another star symbol" so it should be removed from the RFC. Also please discuss potential Unicode normalization issues and how to deal with them.

whiterock commented 3 years ago

Any additional restrictions are meaningless because the evil of Unicode is already inside our house! Nim already supports Unicode identifiers, proc names, etc.:
const рус = 7  # guess what language is it?
echo pyc       # it will not work
echo рус       # this will
Hint: lines 1 and 3 are Russian рус (Cyrillic alphabet) and line 2 is English pyc (Latin alphabet). Guess you are like me, cannot spot the difference?

All we can hope for is common sense of library writers (and may be static analyzers).

This is a good point!

⋆ might look too much like *

Considering the first point - If we keep removing everything then this RFC will become pointless, especially since we already removed a lot from what Julia uses which again is only a tiny tiny fraction of what could be considered unicode operator symbols.

"here is another star symbol"

Well * is an asterisk and ⋆ is a star. But I could offer ★ instead. This cannot be confused with ⋆

xigoi commented 3 years ago

When we're at it, let's also remove three of I, l, 1 and | because they look too similar in some fonts.

haxscramper commented 3 years ago

To be honest I don't see an issue with U being similar to ∪ because well, if I write {val1, val2} ∪ {val1, val2} I would expect it to be obvious to anyone that this is a set union operator, and removing ∩ as well, for "consistency reasons" also does not make much sense to me. Same for cartesian product: [1,2,3] × [2,4,5], which seems pretty normal to me.

Just comparing abstract Unicode characters for similarity without any context whatsoever is probably not the best idea, especially for ones that are similar to letters. As @Vindaar already mentioned, it is not possible to use x as an infix operator, so expr × expr, expr ∪ expr and expr ∩ expr have only one possible interpretation.

whiterock commented 3 years ago

Just comparing abstract Unicode characters for similarity without any context whatsoever is probably not the best idea, especially for ones that are similar to letters. As @Vindaar already mentioned, it is not possible to use x as an infix operator, so expr × expr, expr ∪ expr and expr ∩ expr have only one possible interpretation.

I feel the same. I am especially sad I had to remove expr × expr (and the reason being that it looks awkward on one low res terminal) since this is imho the single most useful operator out of all operators listed, but maybe I am biased since I am writing a raytracer atm and am already so tired of writing cross(n,o) ...

HugoGranstrom commented 3 years ago

When we're at it, let's also remove three of I, l, 1 and | because they look too similar in some fonts.

Really good point! This is equally font-dependent as well as how similar they are!

If we go on with this PR we either go big and add a decent amount of unicode operators or otherwise I don't see much of a point in just including a small set of operators, it will be nearly as limiting as the current situation if the most useful operators like × aren't included.

Araq commented 3 years ago

Considering the first point - If we keep removing everything then this RFC will become pointless,

This would have been my last removal though. :-)

Araq commented 3 years ago

Just comparing abstract Unicode characters for similarity without any context whatsoever is probably not the best idea, especially for ones that are similar to letters. As @Vindaar already mentioned, it is not possible to use x as an infix operator, so expr × expr, expr ∪ expr and expr ∩ expr have only one possible interpretation.

Good points.

whiterock commented 3 years ago

Good points.

So should I readd everything then? :) (including ★ instead of ⋆)

Araq commented 3 years ago

Ok.

Araq commented 3 years ago

Please outline Unicode normalization issues and then feel free to submit a lexer/parser patch as a first implementation.

xigoi commented 3 years ago

How is normalization handled for identifiers? I think it should be the same for operators.

Araq commented 3 years ago

There isn't any Unicode normalization performed by the Nim compiler.

Araq commented 3 years ago

An implementation for the upcoming 1.6 would be nice to have, 1.6 already introduces user defined numeric literals so all the tooling related lexers need updates, the lexers might as well add support for Unicode operators then.

timotheecour commented 3 years ago

See my counter proposal here: https://github.com/nim-lang/RFCs/issues/390

aniou commented 3 years ago

Nim already supports Unicode identifiers, proc names, etc.:

From my POV: unfortunately. It reminds me creating International Domain Names - thing that doesn't provide any useful (or even "usable") feature and nowadays is used almost solely by abusers and phishers. Python allows UTF-based identifiers from version 3, if I recall correctly, and so far they are used for deliberate code obfuscation and by clueless newbies that writes inmaintanable snipets of code in their native languages.

All we can hope for is common sense of library writers (and may be static analyzers).

In security and reliability areas a "common sense of users or developers" is a latest thing that we should depend on. ;)

Vindaar commented 3 years ago

Sorry in advance for maybe sounding a bit snarky, but I'm a bit tired of this kind of FUD.

Nim already supports Unicode identifiers, proc names, etc.:

From my POV: unfortunately.

Your POV, exactly.

It reminds me creating International Domain Names - thing that doesn't provide any useful (or even "usable") feature

I'm sorry, but this is a bad take and I don't even know where to start. The use case of these two things are so very different.

Python allows UTF-based identifiers from version 3, if I recall correctly, and so far they are used for deliberate code obfuscation

I'm pretty sure you're just focusing on what you want. In that line of thought you might as well say "higher level languages" are bad, because they allow bad actors to write code more easiliy.

and by clueless newbies that writes inmaintanable snipets of code in their native languages.

You should be very aware that with people who just start coding the characters used is the least of all problems in understanding what's going on.

All we can hope for is common sense of library writers (and may be static analyzers).

In security and reliability areas a "common sense of users or developers" is a latest thing that we should depend on. ;)

Here, I wrote a super sophisticated analyzer for you. It makes it easy for you to decide which files to use in your projects:

import os, unicode
proc main(f: string) =
  for r in runes(f.readFile):
    if r.size > 1:
      echo "I'm a bad, bad file! :("
      quit(1)
  echo "I'm a safe file :)"
main(paramStr(1))

aniou commented 3 years ago

Here, I wrote a super sophisticated analyzer for you. It makes it easy for you to decide which files to use in your projects:

And You made a simple mistake: Your program doesn't recognize identifiers and string literals. Looks like, unintentionally, you made an example that supports my argumentation: dealing with Unicode is only seemingly simple.

a-mr commented 3 years ago

@whiterock @Araq

I suggest to enforce that such operators are always surrounded by white spaces.

If such an operator, e.g. (b∨c), is not — then a syntax error will be generated like:

file(line,column) Error: No white space before and after infix unicode operator '∨'

Rationale:

not all text editors can detect|highlight unicode operators. Spaces allow to understand what is what by the position as @Vindaar noticed
even in 2021 most terminals/fonts have problems with determining proper width of a symbol so it had better be surrounded by white spaces to prevent such cropping:

(symbol widths in unicode "monospace" fonts are not constant! see e.g. https://github.com/nim-lang/RFCs/issues/279)

Araq commented 3 years ago

In security and reliability areas a "common sense of users or developers" is a latest thing that we should depend on.

Syntax is not semantics. Security and reliability are about semantics. Feel free to ban everything in the name of security and reliability: Dependencies to 3rd party software, exceptions, dynamic memory management, if-statements, loops, recursion, pointers, arrays, immutable data (can lead to memory leaks), mutable data, "magic numbers", identifiers without underscores, single letter identifiers, function overloading, operator overloading, goto statements, dynamic function calls, floating point numbers, unsigned numbers, ...

You can construct good arguments against every language feature from this list. What does this tell us? Your argument is indeed invalid.

I mean, by your logic, https://github.com/Battelle/movfuscator would be heaven for security and reliabilty, only a single instruction to learn/review/test, only linear control flow! In reality it's pretty close to the worst programming system I can imagine for static analysis...

aniou commented 3 years ago

Syntax is not semantics. Security and reliability are about semantics.

I disagree. Readability and unambiguity has impact on code quality and maintanability - thus on security and reliability. For example a syntax, that allow using underscores in numeric literals leads to less "order-of-magnitude errors", because a 1_000_000_000 is more readable than 1000000000.

Another example comes from Perl, where style guide encourages to aligning items vertically - because human brain is usually good in patterns and able to spot errors (duplicated lines after a quick copy/update) in some kind vertically aligned assignments or tables - they are visible as regularities or irregularities, than can be easily spot.

I like Nim because of simplicity and readability of syntax, that locates it on the opposite side to the, for example, Rust.

Araq commented 3 years ago

Well the people here argue that Unicode operators makes their code more readable and you repeat the usual irrelevant "but it can be misused" argument. (Since everything can be misused that means the argument doesn't hold any water.)

I disagree. Readability and unambiguity has impact on code quality and maintanability - thus on security and reliability

This connection is very weak. If you care about security and reliability use formal methods and test the heck out of your system. Linux's quality (or lack thereof) is not related to C's quirky syntax (but may be related to C's semantics).

xigoi commented 3 years ago

not all text editors can detect|highlight unicode operators. Spaces allow to understand what is what by the position as @Vindaar noticed

even in 2021 most terminals/fonts have problems with determining proper width of a symbol so it had better be surrounded by white spaces to prevent such cropping: (symbol widths in unicode "monospace" fonts are not constant! see e.g. add terminalWidth(a: openArray[char]): int to get number of terminal cells of utf8 string #279)

If you use tools that don't support Unicode in 2021, that's your problem. Nim is a modern language.

Araq commented 3 years ago

If you use tools that don't support Unicode in 2021, that's your problem. Nim is a modern language.

That's way too harsh. Nim tries to find compromises that don't depend on too specific tooling. However, the Unicode operators would be used by people who have tools that can handle Unicode, the stdlib wouldn't use them. In the past "the standard library sets the guidelines" worked very well for us, we don't nanny people all the time.

akbcode commented 3 years ago

I would suggest that such operators should be required to be annotated with a pragma like operator or something for giving the operator a name.

func `⊗`(a, b: Mat): Mat {.operator: kronecker.} = ...

They can then be invoked as either

a ⊗ b
a.kronecker(b)
kronecker(a, b)
a.kronecker b

This will let people to avoid unicode operators in their code base if they wish. It could also be used by nimsuggest to provide more information about the symbol. It would also be useful for custom user defined operators like /% which have no inherent meaning.

Araq commented 3 years ago

It's not hard to use this in your code:


template kronecker(a, b: untyped): untyped = a ⊗ b

No additional pragmas required. The language doesn't have to become your nanny.

akbcode commented 3 years ago

Well it can be done with a macro without affecting the language. Support for additional operators would be a welcome addition anyway.

I would hope though that the approved operators are no longer allowed to be a part of identifiers without quoting with ``. This works right now.

let ±x = 10
echo ±x

Araq commented 3 years ago

I would hope though that the approved operators are no longer allowed to be a part of identifiers without quoting with ``. This works right now.

Correct and it's why it would be a breaking change. But a pretty tame one.

Araq commented 3 years ago

Has been implemented.

nim-lang / RFCs