qntm / greenery

Regular expression manipulation library
http://qntm.org/greenery
MIT License
331 stars 40 forks source link

Parse escaped characters `\x??`, `\u????` and `\U????????` #100

Open mristin opened 1 year ago

mristin commented 1 year ago

It seems that greenery does not support escaped characters:

import greenery
greenery.parse(
    '^[\\x09\\x0A\\x0D\\x20-\\uD7FF\\uE000-\\uFFFD\\U00010000-\\U0010FFFF]*$'
)

... throws

greenery.parse.NoMatch: Could not parse '^[\\x09\\x0A\\x0D\\x20-\\uD7FF\\uE000-\\uFFFD\\U00010000-\\U0010FFFF]*$' beyond index 1

However, Python's re module works with the escapes:

import re
re.compile(
    '^[\\x09\\x0A\\x0D\\x20-\\uD7FF\\uE000-\\uFFFD\\U00010000-\\U0010FFFF]*$'
)

I would expect greenery to match built-in module re in this regard. Or is this behavior by design?

qntm commented 1 year ago

The parser is intentionally very simple (note also the mostly useless parsing error it emits) because parsing all possible regular expression syntax wasn't really the core problem greenery was designed to solve. One workaround is to stop double-escaping those characters:

import greenery
greenery.parse(
    '^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF]*$'
)

I will consider enhancing the parser to be able to handle double-escaped characters.