mrabarnett / mrab-regex

Other
444 stars 49 forks source link

Ability to query minimum/maximum length of regex #513

Open MegaIng opened 11 months ago

MegaIng commented 11 months ago

For the lark parsing library we use the (sadly private) stdlib re._parser library to query the minimum and maximum length of a regex:

https://github.com/lark-parser/lark/blob/942366b49247e996e387cb901ed96c7d861382a0/lark/utils.py#L132-L156

As can be seen from the snippet, since we also support using regex instead of re, we need to take special care when encountering regex specific syntax, like nested sets category patterns. The only value that needs to be correct is if minimum length is 0 or greater since we depend on Regular Expressions being non-empty in a few places.

It would be nice if there was a way a query the minimum and maximum match size from a compiled regex object. The stdlib re module is lower priority since there there is at least a way to accesses this information reliably, but I am probably also going to make a request there.

mrabarnett commented 11 months ago

Does you expect the minimum and maximum to be accurate? That would be difficult if there were references to capture groups or calls.

MegaIng commented 11 months ago

accurate in the sense that all possible matches will fall into this range, even if no match with that length actually exists, yes. And preferably of-course the 0/1+ distinction of min should be fully accurate.

For our usecase is would actually be fully fine if this is only correctly supported for purely regular syntax, i.e. no backrefrences or non-regular extensions like nested calls.

mrabarnett commented 11 months ago

The minimum length is already available internally, except that it doesn't include references or calls (it assumes that they have zero length). The maximum length is more of a problem...