Open pdubroy opened 7 years ago
One more note: it seems that IDStart is a subset of (Ltmou | Nl), and doesn't include `or
$`. So I'm a bit confused that the Rust and Python specs say that identifiers must begin with an XID_start character.
According to http://unicode.org/reports/tr31/#Specific_Character_Adjustments, _
and $
are optional characters for ID_Start
. ES5/ES2015 adds those two characters to ID_Start
but Python or Rust might not do so. Since Ohm does not allow to exclude characters when overriding, I would suggest we go with the "clean" ID_Start
and ID_Continue
implementation and have e.g. ES6/ES2015 extend those with the extra characters.
One more thing I stumbled upon yesterday, is that ident
is already heavily used in grammars (Ohm itself uses it, so do some of @alexwarth teaching grammars). It would be a bigger change (bootstrapping involved) to change this and on top of that I am not sure we would want rule names to span the full spectrum of (Unicode) identifier names (some of them not being valid in ES5).
Re: _
and $
, I'm a bit confused by that Unicode document, but I think I've figured it out. From http://unicode.org/reports/tr31/#R1:
Another such profile would be to include some set of the optional characters, for example:
- Start := XID_Start, plus some characters from Table 3
...where Table 3 contains exactly _
and $
. So what this tells me is that _
and $
do NOT have the XID_Start property, however, this standard suggests that some languages may want to explicit add those as valid identStart characters. The unicode-9.0.0 package seems to confirm this interpretation:
> var xid_start = require('unicode-9.0.0/Binary_Property/XID_Start/regex');
undefined
> !!xid_start.exec('a')
true
> !!xid_start.exec('_')
false
> !!xid_start.exec('$')
false
What confuses me about Python and Rust is that they do both allow leading underscores in variable names...yet claim that identStart is just XID_Start. Oh well.
Anyways, now that I feel like I understand this...I agree with what you said, that our identStart
rule should be "clean" and not include _
and $
.
Re: ident
: I don't think it's a problem to introduce ident
as a built-in rule. It would be a breaking change, but anyone who upgrades would get an error at grammar instantiation time, and could just change their definition to use :=
.
But now that I think about it, maybe it's better to not include a built-in ident
rule, because of things like $
and _
, and other similar exceptions (-
and *
in Lisp, '
in ML, ?
in Ruby, etc.). However I still think unicodeIdStart
and unicodeIdContinue
would still be useful.
Thoughts?
Well, we could put in an ident
rule and describe it in the documentation. It should use identStart
and identPart
which just references unicodeIdStart
and unicodeIdContinue
by default. Any language that needs extra characters, like _
or $
, could extend the identStart
rule to include those as well.
Yes, that would work. However, I'd lean towards not doing it that way because we can't really provide a sensible default, and basically every grammar would need to override identStart
.
This would mean that you'd need to understand rule overriding to make almost any useful grammar, which is not the case right now. (Hmmm, though I suppose to support comments you do need to extend the space
rule. But ident
is a bit trickier since you need to understand that the hidden definition of ident
begins with identStart
.)
I agree. My vote would be to not have a built-in ident
rule (because as Pat said, we can't provide a sensible default) but keep unicodeIdStart
and unicodeIdContinue
.
Currently, there is no default. Everyone has to implement an identifier rule on his/her own.
With ident
& co. there would be a default. A good one, that sometimes would need be extended, especially if you try to build a parser for an existing language and obsess over 100% correctness (in which case you should/need to know about overriding/extending).
If you want to build your own language, which hopefully is the majority of cases, why wouldn't you go with a standardized default (that is better than letter alnum*
you probably would come up with on your own)?
While I agree that a built-in ident
rule would be convenient, I worry about the problems and confusion that may result from programmers not fully understanding what it does. E.g., suppose I want to use $
as an operator in a new language,
Exp = ident `$` ident
but "for some reason" I can't parse foo$bar
as an Exp
. WAT!?!?
IMO the small amount of additional work that goes into writing your own ident
rule more than pays for itself because it avoids this kind of confusion.
It would be nice to have some built-in rules for identifiers, so that we have a sensible default for people building new languages in Ohm. For reference, I took a look at a few different languages to see if any of them use the Unicode ID_Start/ID_Continue. Here are the results:
Not sure this makes things any clearer, but I thought it would be good to have a few data points.