qntm / greenery

Regular expression manipulation library
http://qntm.org/greenery
MIT License
309 stars 40 forks source link

Accept unescaped caret in character ranges #102

Open mristin opened 10 months ago

mristin commented 10 months ago

(I am not quite sure what part of the regular expression is problematic for greenery, so please change the title accordingly.)

I can compile the following pattern with re:

import re
re.compile(
    '^([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+/([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+([ \t]*;[ \t]*([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+=(([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+|"(([\t !#-\\[\\]-~]|[\\x80-\\xff])|\\\\([\t !-~]|[\\x80-\\xff]))*"))*$'
)

... but greenery fails:

import greenery

greenery.parse(
    '^([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+/([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+([ \t]*;[ \t]*([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+=(([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+|"(([\t !#-\\[\\]-~]|[\\x80-\\xff])|\\\\([\t !-~]|[\\x80-\\xff]))*"))*$'
)

with the exception:

greenery.parse.NoMatch: Could not parse '^([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+/([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+([ \t]*;[ \t]*([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+=(([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+|"(([\t !#-\\[\\]-~]|[\\x80-\\xff])|\\\\([\t !-~]|[\\x80-\\xff]))*"))*$' beyond index 1
mristin commented 10 months ago

This might be related to #100 -- though the error message is a bit confusing here (index 1 is (, I suppose).

mristin commented 10 months ago

When I undo the special characters (to circumvent #100), I still get an exception:

greenery.parse(
    '^([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+/([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+([ \t]*;[ \t]*([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+=(([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+|"(([\t !#-\\[\\]-~]|[\x80-ÿ])|\\\\([\t !-~]|[\x80-ÿ]))*"))*$'
)

The exception:

greenery.parse.NoMatch: Could not parse '^([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+/([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+([ \t]*;[ \t]*([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+=(([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+|"(([\t !#-\\[\\]-~]|[\x80-ÿ])|\\\\([\t !-~]|[\x80-ÿ]))*"))*$' beyond index 1

(Mind that characters \x80 are not escaped in the pattern.)

The re works ok:

import re
re.compile(
     '^([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+/([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+([ \t]*;[ \t]*([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+=(([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+|"(([\t !#-\\[\\]-~]|[\x80-ÿ])|\\\\([\t !-~]|[\x80-ÿ]))*"))*$'
)
mristin commented 10 months ago

I narrowed this down to:

greenery.parse("[!#$%&'*+\\-.^]")

which fails. Escaping ^ fixes the issue:

greenery.parse("[!#$%&'*+\\-.\\^]")

This is probably by design, if I understood the readme correctly?

qntm commented 10 months ago

Correct, the parser is intentionally very simple and if you want a literal caret in a character class you need to backslash-escape it. There are lot of sophisticated bits of syntax for character classes, like [^-] and [^^] and []], which are technically unambiguous but in practice (1) I consider confusing to read and (2) are a total headache to implement when parsing. I will consider enhancing the parser to handle this but for now the workaround is backslashes.

MegaIng commented 1 month ago

Note that my project interegular tries quite a bit harder to match stdlib's re syntax and I am currently reworking it to use greenery.fsm in the background, so that might be a better fit for your usecase.