Only allow ASCII in identifiers to prevent homograph attacks?

Anton-4 commented 2 years ago

https://en.wikipedia.org/wiki/IDN_homograph_attack

folkertdev commented 2 years ago

This is a serious limitation for any non-english language. E.g. many tutorials for children use the native language for variable names, and if that is an error that is kinda not great.

Also we have some code at work that uses dutch words because that's the domain we're modelling. (dutch uses mostly ascii, but french, german, scandinavian languages, ... just don't in a fundamental way)

And this is not even considering arabic or japanese

Anton-4 commented 2 years ago

Good points, I've added a question mark to the issue title to show that it is open to discussion.

rtfeldman commented 2 years ago

Original motivation for looking into this was the appendix note here: https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

Reading between the lines, it seems that Rust originally supported non-ASCII identifiers and then switched to ASCII only prior to 1.0 - perhaps because you can later add support for non-ASCII (e.g. with security checks in the compiler, such as the ones they're apparently planning to do inside strings) as a nonbreaking change, whereas later disallowing certain identifiers would be a breaking change!

Currently we allow Unicode alphabetic characters (which was a choice I made when writing the original parser, based on wanting to have nice support for non-English languages), but I'm not sure what attacks that permits.

It's possible we should do what Rust did, but I'm not sure!

Anton-4 commented 2 years ago

Encouraging the use of (learning) English by only allowing ASCII for identifiers also has benefits.

If you're a company building software, you might wish you built your codebase in English as this gives you the ability to hire anyone in the world who knows Roc as opposed to only those who speak the same language and know Roc.
You will have the largest amount of up-to-date documentation available.
It's easy to share your code when you need help.

I've reaped a lot of benefits from being encouraged to learn English because a lot of cool stuff was not available in my language.

brian-carroll commented 2 years ago

ASCII might be an overly extreme restriction for this. Seems like the problem is with a category of code points that are about switching text direction, or zero-width spaces, and so on.

Python supports Unicode identifiers. This Python doc lists the character categories they allow at the start and for the remainder. And they have a more detailed spec

The Java docs on identifiers say:

Letters and digits may be drawn from the entire Unicode character set, which supports most writing scripts in use in the world today, including the large sets for Chinese, Japanese, and Korean. This allows programmers to use identifiers in their programs that are written in their native languages.

The C# docs say that they support Unicode and again there's a formal spec for the exact character categories.

If they can do it, I'm sure we can too.

Anton-4 commented 2 years ago

My takeaways are:

We should check if our current unicode alphabetic characters allow attacks.
We need to investigate differences between our supported characters and those of languages like Python, Java and C#.
How much effort will need to go into making thorough compiler security checks for our limited unicode set?
If the checks would require a lot of effort, we should probably limit ourselves to ASCII until the security checks are ready.

TimWhiting commented 2 years ago

Dart is taking the approach of providing a lint error that is by default enabled. Dart lints can be disabled on a per line or per file basis.

Anton-4 commented 2 years ago

Dart lints can be disabled on a per line or per file basis.

That seems risky. I'd say it's not that difficult to sneak in a line of code that disables a lint rule.

lawrencejob commented 2 years ago

I think there's some merit to a compiler flag or configuration rule that enables 1 or more character set/dictionary if this is deemed to be a concern. As @Anton-4 points out, it's either a) ugly to opt out every time you want to use a character that's native to you or b) too easy to be tricked into opting out of a rule on a single line basis, especially if you don't know what you're doing (the kind of developer happy to paste from the internet).

I don't think a project like Roc should be the one to develop an opinion on characterset dictionaries/taxonomy, so maybe there's prior art or a standard that can be the basis for such a thing.

Personally, I don't see it as a risk, and I think, architecturally, it's the role of an IDE/text editor to protect a user from pasting malicious strings (eg those which contain homograph attacks or similar) by highlighting, even if most IDEs fail to carry out this role today. I think overall most languages don't see this as a threat as you must take for granted that whatever dependency you're referencing is trusted.

I'm trying to think about how an exploit might look:

User pastes code from an unsafe source, which compiles differently to how one would expect
User pastes URL/some kind of API name/configuration for accessing a third party service
User has a dependency on third party code which has been compromised and pretends to be an API exposed by a different package
User pastes a dependent module name and the package manager downloads the incorrect module, which can masquerade as the intended module

Naturally this isn't an exhaustive list, but it seems like the takeaway might be that package managers might need a different set of rules for package names (presumably a conversation taking place elsewhere already), and editors should protect people from straying from their familiar character set.

As an aside, I think it would be a good goal for any good programming language to support any human language, and it's a shame that we live in a world where that isn't true yet (transpiling to different human languages in the editor). (I'm new to the project so I haven't had a chance to read about the spirit of the project.)

roc-lang / roc

Only allow ASCII in identifiers to prevent homograph attacks? #1862