Open MegaIng opened 11 months ago
Does you expect the minimum and maximum to be accurate? That would be difficult if there were references to capture groups or calls.
accurate in the sense that all possible matches will fall into this range, even if no match with that length actually exists, yes. And preferably of-course the 0/1+ distinction of min should be fully accurate.
For our usecase is would actually be fully fine if this is only correctly supported for purely regular syntax, i.e. no backrefrences or non-regular extensions like nested calls.
The minimum length is already available internally, except that it doesn't include references or calls (it assumes that they have zero length). The maximum length is more of a problem...
For the lark parsing library we use the (sadly private) stdlib
re._parser
library to query the minimum and maximum length of a regex:https://github.com/lark-parser/lark/blob/942366b49247e996e387cb901ed96c7d861382a0/lark/utils.py#L132-L156
As can be seen from the snippet, since we also support using
regex
instead ofre
, we need to take special care when encountering regex specific syntax, like nested sets category patterns. The only value that needs to be correct is if minimum length is 0 or greater since we depend on Regular Expressions being non-empty in a few places.It would be nice if there was a way a query the minimum and maximum match size from a compiled regex object. The stdlib re module is lower priority since there there is at least a way to accesses this information reliably, but I am probably also going to make a request there.