orc-lang / orc

Orc programming language implementation
https://orc.csres.utexas.edu/
BSD 3-Clause "New" or "Revised" License
41 stars 3 forks source link

The Orc parser allows zero width space ("\u200B") in identifiers #217

Closed arthurp closed 6 years ago

arthurp commented 6 years ago

Zero width space ("\u200B") is other--format not white space. This means it's not treated as a token boundary or as punctuation, so Orc just allows it to be used as part of identifiers. However it's invisible so it's super hard to find when it's preventing a method lookup from succeeding.

We should probably just disallow the use of characters in that category entirely or at least warn about them or something.

This issue appear because of copy-pasting a class name from Java doc, so this is not totally academic.

jthywiss commented 6 years ago

The ref manual section 10.2.2 says:

Orc scans identifiers per Unicode Standard Annex #31, Unicode Identifier and Pattern Syntax, namely:

  • Identifiers start with "Characters having the Unicode General_Category of uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl)".
  • Identifiers can continue with "All of the above, plus characters having the Unicode General_Category of nonspacing marks (Mn), spacing combining marks (Mc), decimal number (Nd), connector punctuations (Pc)", plus Orc's addition of apostrophe as a "prime" mark.
  • All identifiers are normalized to Unicode Normalization Form C as they are parsed.

So, if the lexical scanner is permitting U+200B, that's a bug.

jthywiss commented 6 years ago

Looks like Java's Character.isUnicodeIdentifierPart bizarrely claims that most control characters (including U+200B) "may be part of a Unicode identifier". UAX 31 begs to differ....

OrcLexical.identChar relies on this erroneous method.

arthurp commented 6 years ago

It turns out that Scala has the same behavior with respect to this character. At least we are in illustrious company with this deeply confusing behavior.

arthurp commented 6 years ago

@jthywiss I think I know why isUnicodeIdentifierPart returns true for Cf characters like U+200B: UAX 31 2.3 specifies that control characters should be allowed in cases where they are needed to make the typesetting make sense in the language (to over simplify). This doesn't make it correct, because at least my reading of that section says that even at the most liberal only ZWJ and ZWNJ should be allowed. I also think that isIdentifierIgnorable​ hints that maybe Java sometimes ignores these characters for comparisons (like case in old case preserving but insensitive file systems). Basically, yes this is a bit stupid and we should fix it on our end, but I think I know how this happened.

jthywiss commented 6 years ago

When we wrote that section of the Orc ref manual, Java's Character class was based on Unicode 4.0. UAX 31 at the time said "[control characters] should normally be ignored for the purposes of identifier definition. Implementations that cannot ignore characters in identifiers should exclude these characters." So yes, Java was expecting that we filter out these characters.

The current docs say that Java 8's Character class is now based on Unicode 6.2, but the Character.isUnicodeIdentifierPart implementation hasn't been kept up-to-date. UAX 31 changed significantly in Unicode 4.1.

In any case, the immediate fix is to disallow default-ignorable characters. We're not up-to-date with the latest UAX 31, but we will be consistent with the ref manual.