Feature request: Unicode properties

ianh / owl

A parser generator for visibly pushdown languages.

MIT License

746 stars 21 forks source link

Feature request: Unicode properties #35

Open data-man opened 1 year ago

data-man commented 1 year ago

Owl is awesome, thank you!

My proposals:

range(cp1, cp2) or range[cp1, cp2] - cp1 and cp2 are codepoints here (hex or decimal)
block(name) - Unicode's script name (Basic_Latin, Latin-1_Supplement, etc.)
property(name) - Unicode's property name (White_Space, Hyphen, Ps, Mn, etc.)
script(name) - Unicode's script name (Common, Latin, etc.)

What do you think?

ianh commented 1 year ago

Something like this would be possible, but at the moment, every token can be separated by whitespace. For example, if you had a rule like ident = property(ID_Start) property(ID_Continue)*, identifiers would include things like abc but also a b c d. The best way to make custom identifiers right now is via user-defined tokens, which involves writing a bit of code in a C function and passing it to the generated parser to use during tokenization.