no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
BSD 3-Clause "New" or "Revised" License
821 stars 65 forks source link

Unicode identifiers #70

Closed ghost closed 7 years ago

ghost commented 7 years ago

What would be a good rule to match unicode identifiers? I'd like to match Swift-like language identifiers, basically any printable unicode string not starting with a numeral.

kach commented 7 years ago

A regex like [^0-9\s][^\s]* should work.

ghost commented 7 years ago

That seems too greedy.

I need to match Identifiers, as defined by

https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html

tjvr commented 7 years ago

You should be able to write a RegExp based one the Unicode ranges given.

PS. Since this is really to do with RegExps in general, and not specific to Moo, you can always ask this question on StackOverflow. :)

ghost commented 7 years ago

@tjvr I'll kill myself before asking the karma whores at StackOverflow. I'll figure it out.

bd82 commented 7 years ago

@notsonotso

See how Acorn (EcmaScript Parser) implemented complex unicode identifiers. https://github.com/ternjs/acorn/blob/master/src/identifier.js#L28-L34

The resulting regExp is kind of horrible, but at least it is generated and not implemented "by hand".

ghost commented 7 years ago

@bd82 Excellent, thanks