timtadh / lexmachine

Lex machinary for go.
Other
405 stars 28 forks source link

Character classes that reference other character classes #34

Open SteelPhase opened 4 years ago

SteelPhase commented 4 years ago

Would it be possible to have the regex parser support character classes like \w within other character classes? I had a regex pattern earlier that used the character class [0-9a-zA-Z_\.-], and I attempted to simplify it with [\w\.\-]. I didn't notice this library doesn't support doing that, and was wondering just how difficult that would be to implement. For the time being i'm just expanding out \w to 0-9a-zA-Z_ within the character class.

timtadh commented 4 years ago

Looks like this is supported by re2 (which I mostly follow when adding new support for compatibility with Go regexp). https://github.com/google/re2/wiki/Syntax

It has been a while since I worked on the regexp parser. However, adding this support looks doable by extending the charClassItem function to support the built-in classes. The signature would need to change to support returning a list of ranges instead of just one.

Do you have other feature requests for the regexp language? I have mostly followed the principle of implementing the portions people ask for.

SteelPhase commented 4 years ago

This is purely a nice to have, as it's easy enough to just do it myself. The only other one I've run into is the need to strip non capturing group syntax from existing regex expressions. Still simple to work around by stripping the ?: at the start of a group

timtadh commented 4 years ago

Ok. Less likely to handle ignoring ?: as capture groups are not something that lexmachine is likely to support (as it is likely better to implement that sort of logic in a different way, perhaps by having multiple tokens).

Adding support for the built-in character class to be used inside of a [] character class seems like a good idea.

SteelPhase commented 4 years ago

Thanks for taking look into this