python / cpython

The Python programming language
https://www.python.org
Other
63.22k stars 30.27k forks source link

Ability to query minimum and maxium length of regular expression #112386

Open MegaIng opened 11 months ago

MegaIng commented 11 months ago

Feature or enhancement

Proposal:

For the lark parsing library we currently use the private re._parser module, as noticed when reorganizing the relevant libraries in #91308. The only information we need is the minimum and maximum width of a match a pattern can have.

My suggestion is to add relevant attributes/properties to the Pattern class, for example as with the names min_width and max_width. max_width could be either None or MAXREPEAT (the constant from re._constants/_sre) when the pattern could match an (essentially) unlimited amount of text.

pattern = re.compile(r"abc?d?e")
assert (pattern.min_width, pattern.max_width) == (3, 5)

pattern = re.compile(r"(a*b+){2, 5}")

assert (pattern.min_width, pattern.max_width) == (2, None)

As an alternative, the re._* modules could be made a public and stable API, although this doesn't appear to be a well liked option from my reading of the above linked PR. I would like this, primarily for implementing custom regex analyzers (there a few such users of the re._parser module out there), but I think this would have to be a PEP.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

I don't think this is a major enough feature to require widespread discussion. I requested a similar feature in the third party regex library. Preferably ofcourse both would have the same interface.

TheCob11 commented 10 months ago

Shouldn't the bounds for the second example be (2, None)/(2, MAXREPEAT) since bb would match?

MegaIng commented 10 months ago

Yes, I don't know what my thought process there was.