roc-lang / roc

A fast, friendly, functional language.
https://roc-lang.org
Universal Permissive License v1.0
4.06k stars 287 forks source link

Only allow ASCII in identifiers to prevent homograph attacks? #1862

Open Anton-4 opened 2 years ago

Anton-4 commented 2 years ago

https://en.wikipedia.org/wiki/IDN_homograph_attack

folkertdev commented 2 years ago

This is a serious limitation for any non-english language. E.g. many tutorials for children use the native language for variable names, and if that is an error that is kinda not great.

Also we have some code at work that uses dutch words because that's the domain we're modelling. (dutch uses mostly ascii, but french, german, scandinavian languages, ... just don't in a fundamental way)

And this is not even considering arabic or japanese

Anton-4 commented 2 years ago

Good points, I've added a question mark to the issue title to show that it is open to discussion.

rtfeldman commented 2 years ago

Original motivation for looking into this was the appendix note here: https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

Reading between the lines, it seems that Rust originally supported non-ASCII identifiers and then switched to ASCII only prior to 1.0 - perhaps because you can later add support for non-ASCII (e.g. with security checks in the compiler, such as the ones they're apparently planning to do inside strings) as a nonbreaking change, whereas later disallowing certain identifiers would be a breaking change!

Currently we allow Unicode alphabetic characters (which was a choice I made when writing the original parser, based on wanting to have nice support for non-English languages), but I'm not sure what attacks that permits.

It's possible we should do what Rust did, but I'm not sure!

Anton-4 commented 2 years ago

Encouraging the use of (learning) English by only allowing ASCII for identifiers also has benefits.

I've reaped a lot of benefits from being encouraged to learn English because a lot of cool stuff was not available in my language.

brian-carroll commented 2 years ago

ASCII might be an overly extreme restriction for this. Seems like the problem is with a category of code points that are about switching text direction, or zero-width spaces, and so on.

Python supports Unicode identifiers. This Python doc lists the character categories they allow at the start and for the remainder. And they have a more detailed spec

The Java docs on identifiers say:

Letters and digits may be drawn from the entire Unicode character set, which supports most writing scripts in use in the world today, including the large sets for Chinese, Japanese, and Korean. This allows programmers to use identifiers in their programs that are written in their native languages.

The C# docs say that they support Unicode and again there's a formal spec for the exact character categories.

If they can do it, I'm sure we can too.

Anton-4 commented 2 years ago

My takeaways are:

TimWhiting commented 2 years ago

Dart is taking the approach of providing a lint error that is by default enabled. Dart lints can be disabled on a per line or per file basis.

Anton-4 commented 2 years ago

Dart lints can be disabled on a per line or per file basis.

That seems risky. I'd say it's not that difficult to sneak in a line of code that disables a lint rule.

lawrencejob commented 2 years ago

I think there's some merit to a compiler flag or configuration rule that enables 1 or more character set/dictionary if this is deemed to be a concern. As @Anton-4 points out, it's either a) ugly to opt out every time you want to use a character that's native to you or b) too easy to be tricked into opting out of a rule on a single line basis, especially if you don't know what you're doing (the kind of developer happy to paste from the internet).

I don't think a project like Roc should be the one to develop an opinion on characterset dictionaries/taxonomy, so maybe there's prior art or a standard that can be the basis for such a thing.

Personally, I don't see it as a risk, and I think, architecturally, it's the role of an IDE/text editor to protect a user from pasting malicious strings (eg those which contain homograph attacks or similar) by highlighting, even if most IDEs fail to carry out this role today. I think overall most languages don't see this as a threat as you must take for granted that whatever dependency you're referencing is trusted.

I'm trying to think about how an exploit might look:

Naturally this isn't an exhaustive list, but it seems like the takeaway might be that package managers might need a different set of rules for package names (presumably a conversation taking place elsewhere already), and editors should protect people from straying from their familiar character set.

As an aside, I think it would be a good goal for any good programming language to support any human language, and it's a shame that we live in a world where that isn't true yet (transpiling to different human languages in the editor). (I'm new to the project so I haven't had a chance to read about the spirit of the project.)