waxeye-org / waxeye

Waxeye is a parser generator based on parsing expression grammars (PEGs). It supports C, Java, JavaScript, Python, Racket, and Ruby.
https://waxeye-org.github.io/waxeye/index.html
Other
235 stars 38 forks source link

Support Unicode escapes in the grammar #49

Closed glebm closed 7 years ago

glebm commented 7 years ago

The syntax I like most is:

\uXXXXXX

Other alternatives are:

\u{XXXXXX}

And:

\x{XXXXXX}

In all cases, 1 to 6 hex digits are accepted.

Need to decide whether to support this in literals or only character classes. Unicode escapes should be allowed in literals and character classes.

orlandohill commented 7 years ago

I'm fine with \uXXXXXX.

Is there a reason why you wouldn't want to allow Unicode escapes in literals? They're basically already in the grammar with Hex, I just never got to implementing Unicode support before development stopped.

glebm commented 7 years ago

No reason, they should be allowed, that settles it. I didn't realize they currently allow Hex.

glebm commented 7 years ago

While implementing this, I've realized that there is no way to distinguish between

"\uAAAb" and "\uAAAb" (escape is in bold, the first b is an actual letter)

While this can be worked around by escaping the b as well, this is trouble for generated grammar files (everything would need to be escaped, or the generator would need to look behind to decide whether to escape). For this reason, decided on \u{XXXXXX}.

orlandohill commented 7 years ago

Good, makes sense.