Add built-in rules for Unicode ID_Start and ID_Continue

pdubroy commented 7 years ago

It would be nice to have some built-in rules for identifiers, so that we have a sensible default for people building new languages in Ohm. For reference, I took a look at a few different languages to see if any of them use the Unicode ID_Start/ID_Continue. Here are the results:

ES6 (spec): Yes
Rust (spec): Yes
- Uses XID_start and XID_continue
Python3 (spec): Yes
- Uses XID_start and XID_continue
F# (spec): No
- identStart: Ltmou | Nl | "_"
- identPart: identStart | "0..9" | Pc | Mn | Mc | Cf
Swift (spec): No
- Just specifies a bunch of Unicode ranges, not sure what they correspond to
Go (spec): No
- identStart: Ltmou
- identPart: identStart | Nd
Scala (spec): No
- identStart: Ltmou | Nl | "$" | "_"
- identPart: identStart | "0..9" | "_"

Not sure this makes things any clearer, but I thought it would be good to have a few data points.

pdubroy commented 7 years ago

One more note: it seems that IDStart is a subset of (Ltmou | Nl), and doesn't include `or$`. So I'm a bit confused that the Rust and Python specs say that identifiers must begin with an XID_start character.

mroeder commented 7 years ago

According to http://unicode.org/reports/tr31/#Specific_Character_Adjustments, _ and $ are optional characters for ID_Start. ES5/ES2015 adds those two characters to ID_Start but Python or Rust might not do so. Since Ohm does not allow to exclude characters when overriding, I would suggest we go with the "clean" ID_Start and ID_Continue implementation and have e.g. ES6/ES2015 extend those with the extra characters.

mroeder commented 7 years ago

One more thing I stumbled upon yesterday, is that ident is already heavily used in grammars (Ohm itself uses it, so do some of @alexwarth teaching grammars). It would be a bigger change (bootstrapping involved) to change this and on top of that I am not sure we would want rule names to span the full spectrum of (Unicode) identifier names (some of them not being valid in ES5).

pdubroy commented 7 years ago

Re: _ and $, I'm a bit confused by that Unicode document, but I think I've figured it out. From http://unicode.org/reports/tr31/#R1:

Another such profile would be to include some set of the optional characters, for example:

Start := XID_Start, plus some characters from Table 3

...where Table 3 contains exactly _ and $. So what this tells me is that _ and $ do NOT have the XID_Start property, however, this standard suggests that some languages may want to explicit add those as valid identStart characters. The unicode-9.0.0 package seems to confirm this interpretation:

> var xid_start = require('unicode-9.0.0/Binary_Property/XID_Start/regex');
undefined
> !!xid_start.exec('a')
true
> !!xid_start.exec('_')
false
> !!xid_start.exec('$')
false

What confuses me about Python and Rust is that they do both allow leading underscores in variable names...yet claim that identStart is just XID_Start. Oh well.

Anyways, now that I feel like I understand this...I agree with what you said, that our identStart rule should be "clean" and not include _ and $.

pdubroy commented 7 years ago

Re: ident: I don't think it's a problem to introduce ident as a built-in rule. It would be a breaking change, but anyone who upgrades would get an error at grammar instantiation time, and could just change their definition to use :=.

But now that I think about it, maybe it's better to not include a built-in ident rule, because of things like $ and _, and other similar exceptions (- and * in Lisp, ' in ML, ? in Ruby, etc.). However I still think unicodeIdStart and unicodeIdContinue would still be useful.

Thoughts?

mroeder commented 7 years ago

Well, we could put in an ident rule and describe it in the documentation. It should use identStart and identPart which just references unicodeIdStart and unicodeIdContinue by default. Any language that needs extra characters, like _ or $, could extend the identStart rule to include those as well.

pdubroy commented 7 years ago

Yes, that would work. However, I'd lean towards not doing it that way because we can't really provide a sensible default, and basically every grammar would need to override identStart.

This would mean that you'd need to understand rule overriding to make almost any useful grammar, which is not the case right now. (Hmmm, though I suppose to support comments you do need to extend the space rule. But ident is a bit trickier since you need to understand that the hidden definition of ident begins with identStart.)

alexwarth commented 7 years ago

I agree. My vote would be to not have a built-in ident rule (because as Pat said, we can't provide a sensible default) but keep unicodeIdStart and unicodeIdContinue.

mroeder commented 7 years ago

Currently, there is no default. Everyone has to implement an identifier rule on his/her own. With ident & co. there would be a default. A good one, that sometimes would need be extended, especially if you try to build a parser for an existing language and obsess over 100% correctness (in which case you should/need to know about overriding/extending). If you want to build your own language, which hopefully is the majority of cases, why wouldn't you go with a standardized default (that is better than letter alnum* you probably would come up with on your own)?

alexwarth commented 7 years ago

While I agree that a built-in ident rule would be convenient, I worry about the problems and confusion that may result from programmers not fully understanding what it does. E.g., suppose I want to use $ as an operator in a new language,

Exp = ident `$` ident

but "for some reason" I can't parse foo$bar as an Exp. WAT!?!?

IMO the small amount of additional work that goes into writing your own ident rule more than pays for itself because it avoids this kind of confusion.

ohmjs / ohm

Add built-in rules for Unicode ID_Start and ID_Continue #180