qntm / greenery

Regular expression manipulation library
http://qntm.org/greenery
MIT License
311 stars 40 forks source link

Intersection of a pattern with another one restricting the length not possible #103

Closed mristin closed 10 months ago

mristin commented 10 months ago

The following two patterns merge forever, i.e., the statements do not compute on my computer in a reasonable time:

import greenery

r1 = greenery.parse(
    "^file:(/(((%[0-9A-Fa-f][0-9A-Fa-f]|[!$&-,;=]|[:@]|[\-.0-9A-Z_a-z~]))+(/((%[0-9A-Fa-f][0-9A-Fa-f]|[!$&-,;=]|[:@]|[\-.0-9A-Z_a-z~]))*)*)?|//((((%[0-9A-Fa-f][0-9A-Fa-f]|[!$&-,;=]|[\-.0-9A-Z_a-z~])*|(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)|\[(((([0-9A-Fa-f]{1,4}:)?[0-9A-Fa-f]{1,4})?::([0-9A-Fa-f]{1,4}:){3}((1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)|[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4})|(([0-9A-Fa-f]{1,4}:){2}[0-9A-Fa-f]{1,4})?::([0-9A-Fa-f]{1,4}:){2}((1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)|[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4})|(([0-9A-Fa-f]{1,4}:){3}[0-9A-Fa-f]{1,4})?::[0-9A-Fa-f]{1,4}:((1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)|[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4})|(([0-9A-Fa-f]{1,4}:){4}[0-9A-Fa-f]{1,4})?::((1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)|[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4})|(([0-9A-Fa-f]{1,4}:){5}[0-9A-Fa-f]{1,4})?::[0-9A-Fa-f]{1,4}|(([0-9A-Fa-f]{1,4}:){6}[0-9A-Fa-f]{1,4})?::|([0-9A-Fa-f]{1,4})?::([0-9A-Fa-f]{1,4}:){4}((1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)|[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4})|([0-9A-Fa-f]{1,4}:){6}((1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)|[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4})|::([0-9A-Fa-f]{1,4}:){5}((1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)\.(1\d{2}|25[0-5]|2[0-4]\d|[1-9]\d|\d)|[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}))|[Vv][0-9A-Fa-f]+\.(:|[!$&-,;=]|[\-.0-9A-Z_a-z~])+)\])|localhost))?/(((%[0-9A-Fa-f][0-9A-Fa-f]|[!$&-,;=]|[:@]|[\-.0-9A-Z_a-z~]))+(/((%[0-9A-Fa-f][0-9A-Fa-f]|[!$&-,;=]|[:@]|[\-.0-9A-Z_a-z~]))*)*)?)$"
)
r2 = greenery.parse("^.{1,2000}$")
print(r1 & r2)

For some context: I need such merges so that I can supply the regular expression to hypothesis.strategies.from_regexes. Since it does not accept length parameter, this was the only way I came up with to restrict the length of the generated string.

I can imagine that this would need some kind of a hack or special case in the code -- not nice, but I believe that it is worth it.

qntm commented 10 months ago

I think it's improbable that something like this is feasible. To represent r2 alone requires a finite state machine with 2000 states which takes a long time currently although it's possible performance here could be improved. r1 meanwhile is perhaps a smaller FSM in terms of number of states, but significantly more complicated. Computing the intersection of the two, though, is likely to be essentially impossible due to the way they interact. Can you imagine what that FSM would look like? I think it would have a number of states comparable with the product of the number of states in each of the original two FSMs.

As a workaround you might consider generating strings from hypothesis.strategies.from_regex using r1 alone, and filtering out the generated strings over 2000 characters long?

mristin commented 10 months ago

As a workaround you might consider generating strings from hypothesis.strategies.from_regex using r1 alone, and filtering out the generated strings over 2000 characters long?

@qntm Thanks, that's exactly what I am doing.

I'll close the issue as this is beyond feasibility even in theory; I was simply not sure.