Open mishok13 opened 1 year ago
Hi Andrii
I am interested in working on a more structured solution to this issue, unless it's unlikely to be merged. :)
It would be great if you do so. I am willing to help.
will not report any issues if the string produced by handle_constrained_string doesn't match the regex
I've encountered a similar problem recently (see #118) so we definitely need to adjust tests or add more tests that would do full "end2end" testing making sure that generated values do not raise ValidationError
s if supplied to constrained fields.
Right on, I'll make sure to get some time to work on this in the coming days, will get back to you once I have a clearer idea of a solution here.
@Goldziher I want to take care of it soon and it seems that we need to change this part
We have to pass min_length
and max_length
to RegexFactory.__call__
in order to be able to generate proper strings for regexes that contain repetitions and patterns like ^..$
. So my idea is to completely get rid of the highlighted part in the handle_constrained_string
and instead alter limit
for repetitions parts iteratively inside.
WDYT?
sure
Is this still not fixed?
Is this still not fixed?
Unfortunately, no.
Well, that's a shame indeed. I'll have to see if I can work on making a PR in the coming weeks, because this project is really useful and it'd be a shame to leave it in a broken state :)
Well, that's a shame indeed. I'll have to see if I can work on making a PR in the coming weeks, because this project is really useful and it'd be a shame to leave it in a broken state :)
That would be awesome if you could raise a PR!
Starting work on the PR for this issue. Does anyone have any extra guidance before I begin?
Wouldn't wrapping the regex into a non-capturing group work? Say your regex is pattern
, you could feed this string to the regex factory: (?:pattern){min,max}
Nope, doesn't work. Didn't know what I was thinking to be honest. I'm working on the solution proposed above (iteratively updating self._limit
inside RegexFactory
)
@nocturn9x sorry for the delayed response. Unfortunately, I don't have any suggestions. I'm probably not the best person when it comes to regex. As long as you have a solution with corresponding tests that pass, I think we'll be good to go :)
The thing I realized is that this is a much more complicated issue than it seems. On a theoretical level, a regex engine is a Deterministic Finite Automaton (or a Nondeterministic one, if it implements PCRE-like features). A regex is basically a recipe for building an automaton that can accept or reject a given input string (it defines a regular language). Now, if we want to add length constraints to it, we need to have another DFA (or NFA) that does that: mathematically, the construction is rather simple (it's just the Cartesian product of the two), and it is known that the intersection of two DFAs (or NFAs) always exists; In practice this means we'd need an automaton that has as many states as the one built for the regex, and the input string would need to be accepted by both to be valid. That is a rather non-trivial problem to solve, especially if performance is a concern
Also, another thing: we have to remember that it may not always be possible to generate a string that can match a regex and stay within some defined length constraints. For example, if we had something as simple as (a|b|c)[0-9]+
and we wanted our generated string to be of length 1, it would be impossible to satisfy both conditions (because the pattern itself mandates at least 2 symbols). This is something that can likely be detected ahead of time and accounted for by raising an exception
@nocturn9x any updates on this?
@nocturn9x any updates on this?
Heh, currently no. It's a much more complicated issue than it seems at first
I think logically there is no way to consistently solve this problem.
Regex pattern complexity is potentially unbounded, and there is no way to know where to restrict the length of an output string without carefully parsing the regex, which can yield very complex results (imagine many nested groups and |
s), which can even conflict! Nothing really stops a user from using a pattern that violates the min_length
or max_length
restrictions, and if a system to verify these restrictions were to be added, it would fall upon Pydantic (in this case) to implement it.
I think Polyfactory shouldn't try to solve such a hard problem. This SO question asks for how to calculate the maximum and minimum length of strings that would match a regex (with no practical or easy solution at the time of writing this) which could be used to simply compare the minimum and maximum with any min_length
and max_length
to fail early if there is a mismatch, but again, I don't think it's Polyfactory's work to do this, and parsing the pattern would be rather expensive.
IMO, when there is a pattern defined, Polyfactory should just respect it, ignoring any other restriction that may conflict (like min_length
or max_length
). Maybe a warning can be fired if regex restrictions are found together with potentially conflicting others. This should be documented in any case.
@berzi I agree with your assessment. This is a highly non-trivial problem to solve, and even a hypothetical solution would probably be quite computationally expensive
@berzi, @nocturn9x thanks for your input! I agree with both of your assessment as well. I was thinking of raising an error instead of a warning. My reasoning is that if the produced string doesn't conform to the constraints, then it will result in a validation error by whichever library they're using for validation. At that point, the users may think that this is a bug from the side of polyfactory
.
I think logically there is no way to consistently solve this problem.
Regex pattern complexity is potentially unbounded, and there is no way to know where to restrict the length of an output string without carefully parsing the regex, which can yield very complex results (imagine many nested groups and
|
s), which can even conflict! Nothing really stops a user from using a pattern that violates themin_length
ormax_length
restrictions, and if a system to verify these restrictions were to be added, it would fall upon Pydantic (in this case) to implement it.I think Polyfactory shouldn't try to solve such a hard problem. This SO question asks for how to calculate the maximum and minimum length of strings that would match a regex (with no practical or easy solution at the time of writing this) which could be used to simply compare the minimum and maximum with any
min_length
andmax_length
to fail early if there is a mismatch, but again, I don't think it's Polyfactory's work to do this, and parsing the pattern would be rather expensive.IMO, when there is a pattern defined, Polyfactory should just respect it, ignoring any other restriction that may conflict (like
min_length
ormax_length
). Maybe a warning can be fired if regex restrictions are found together with potentially conflicting others. This should be documented in any case.
Actually, reading the linked SO answer it doesn't seem like it would be too hard to implement something that can at least tell what the minimum and maximum bounds for a regex are, and error out if they're not within some user-defined conditions. Turn the regex into an NFA, remove reject states, then use Dijkstra (or BF) to find out the shortest path to an accepting state (the resulting path length will be the minimum length that such automaton will accept). For the maximum length, use DFS or a similar algorithm to detect if the graph of the automaton has cycles: if it does, the max length of a match is infinity, if it doesn't then find the longest path that ends in an accept state (probably by brute force, as any algorithms to do so are probably slow anyway). This would at least make sure that the library can detect an invalid combination of pattern/bounds, although it doesn't strictly speaking solve the generation part of the problem. Still, IMHO, better than nothing
I think logically there is no way to consistently solve this problem. Regex pattern complexity is potentially unbounded, and there is no way to know where to restrict the length of an output string without carefully parsing the regex, which can yield very complex results (imagine many nested groups and
|
s), which can even conflict! Nothing really stops a user from using a pattern that violates themin_length
ormax_length
restrictions, and if a system to verify these restrictions were to be added, it would fall upon Pydantic (in this case) to implement it. I think Polyfactory shouldn't try to solve such a hard problem. This SO question asks for how to calculate the maximum and minimum length of strings that would match a regex (with no practical or easy solution at the time of writing this) which could be used to simply compare the minimum and maximum with anymin_length
andmax_length
to fail early if there is a mismatch, but again, I don't think it's Polyfactory's work to do this, and parsing the pattern would be rather expensive. IMO, when there is a pattern defined, Polyfactory should just respect it, ignoring any other restriction that may conflict (likemin_length
ormax_length
). Maybe a warning can be fired if regex restrictions are found together with potentially conflicting others. This should be documented in any case.Actually, reading the linked SO answer it doesn't seem like it would be too hard to implement something that can at least tell what the minimum and maximum bounds for a regex are, and error out if they're not within some user-defined conditions. Turn the regex into an NFA, remove reject states, then use Dijkstra (or BF) to find out the shortest path to an accepting state (the resulting path length will be the minimum length that such automaton will accept). For the maximum length, use DFS or a similar algorithm to detect if the graph of the automaton has cycles: if it does, the max length of a match is infinity, if it doesn't then find the longest path that ends in an accept state (probably by brute force, as any algorithms to do so are probably slow anyway). This would at least make sure that the library can detect an invalid combination of pattern/bounds, although it doesn't strictly speaking solve the generation part of the problem. Still, IMHO, better than nothing
That sounds like a significant performance cost (consider that users may have many fields with arbitrarily large regexes) for not very much, especially if the best goal that can be achieved is guarding for errors: the alternative of simply detecting the presence of quasi-incompatible fields is computationally negligible and achieves the same.
In practice, any user that writes a regex either can encode the maximum length in the regex itself or automatically receives a length requirement. A pattern like abc[0-9]
will only ever accept a length of exactly 4; a pattern like the OP's ^a+b$
could (or should, even) just as easily replace the +
with a length rule. I can't think of any real use case where regex and min/max lengths have a reason to exist together, but allowing this to be done is, as you yourself describe, rather cumbersome and imperfect, even if it's technically possible.
I think logically there is no way to consistently solve this problem. Regex pattern complexity is potentially unbounded, and there is no way to know where to restrict the length of an output string without carefully parsing the regex, which can yield very complex results (imagine many nested groups and
|
s), which can even conflict! Nothing really stops a user from using a pattern that violates themin_length
ormax_length
restrictions, and if a system to verify these restrictions were to be added, it would fall upon Pydantic (in this case) to implement it. I think Polyfactory shouldn't try to solve such a hard problem. This SO question asks for how to calculate the maximum and minimum length of strings that would match a regex (with no practical or easy solution at the time of writing this) which could be used to simply compare the minimum and maximum with anymin_length
andmax_length
to fail early if there is a mismatch, but again, I don't think it's Polyfactory's work to do this, and parsing the pattern would be rather expensive. IMO, when there is a pattern defined, Polyfactory should just respect it, ignoring any other restriction that may conflict (likemin_length
ormax_length
). Maybe a warning can be fired if regex restrictions are found together with potentially conflicting others. This should be documented in any case.Actually, reading the linked SO answer it doesn't seem like it would be too hard to implement something that can at least tell what the minimum and maximum bounds for a regex are, and error out if they're not within some user-defined conditions. Turn the regex into an NFA, remove reject states, then use Dijkstra (or BF) to find out the shortest path to an accepting state (the resulting path length will be the minimum length that such automaton will accept). For the maximum length, use DFS or a similar algorithm to detect if the graph of the automaton has cycles: if it does, the max length of a match is infinity, if it doesn't then find the longest path that ends in an accept state (probably by brute force, as any algorithms to do so are probably slow anyway). This would at least make sure that the library can detect an invalid combination of pattern/bounds, although it doesn't strictly speaking solve the generation part of the problem. Still, IMHO, better than nothing
That sounds like a significant performance cost (consider that users may have many fields with arbitrarily large regexes) for not very much, especially if the best goal that can be achieved is guarding for errors: the alternative of simply detecting the presence of quasi-incompatible fields is computationally negligible and achieves the same.
In practice, any user that writes a regex either can encode the maximum length in the regex itself or automatically receives a length requirement. A pattern like
abc[0-9]
will only ever accept a length of exactly 4; a pattern like the OP's^a+b$
could (or should, even) just as easily replace the+
with a length rule. I can't think of any real use case where regex and min/max lengths have a reason to exist together, but allowing this to be done is, as you yourself describe, rather cumbersome and imperfect, even if it's technically possible.
Heh, wish I could tell that to a client of ours that is hellbent on enforcing both. Alas, the current solution I implemented is the good ol' try until you succeed method (also known as generating the string in a while loop until Pydantic is happy, which is utterly horrible, but it works :')). For what it's worth though, I agree: mixing regular expressions and length constraints constitutes a misuse of the tool
I think logically there is no way to consistently solve this problem. Regex pattern complexity is potentially unbounded, and there is no way to know where to restrict the length of an output string without carefully parsing the regex, which can yield very complex results (imagine many nested groups and
|
s), which can even conflict! Nothing really stops a user from using a pattern that violates themin_length
ormax_length
restrictions, and if a system to verify these restrictions were to be added, it would fall upon Pydantic (in this case) to implement it. I think Polyfactory shouldn't try to solve such a hard problem. This SO question asks for how to calculate the maximum and minimum length of strings that would match a regex (with no practical or easy solution at the time of writing this) which could be used to simply compare the minimum and maximum with anymin_length
andmax_length
to fail early if there is a mismatch, but again, I don't think it's Polyfactory's work to do this, and parsing the pattern would be rather expensive. IMO, when there is a pattern defined, Polyfactory should just respect it, ignoring any other restriction that may conflict (likemin_length
ormax_length
). Maybe a warning can be fired if regex restrictions are found together with potentially conflicting others. This should be documented in any case.Actually, reading the linked SO answer it doesn't seem like it would be too hard to implement something that can at least tell what the minimum and maximum bounds for a regex are, and error out if they're not within some user-defined conditions. Turn the regex into an NFA, remove reject states, then use Dijkstra (or BF) to find out the shortest path to an accepting state (the resulting path length will be the minimum length that such automaton will accept). For the maximum length, use DFS or a similar algorithm to detect if the graph of the automaton has cycles: if it does, the max length of a match is infinity, if it doesn't then find the longest path that ends in an accept state (probably by brute force, as any algorithms to do so are probably slow anyway). This would at least make sure that the library can detect an invalid combination of pattern/bounds, although it doesn't strictly speaking solve the generation part of the problem. Still, IMHO, better than nothing
That sounds like a significant performance cost (consider that users may have many fields with arbitrarily large regexes) for not very much, especially if the best goal that can be achieved is guarding for errors: the alternative of simply detecting the presence of quasi-incompatible fields is computationally negligible and achieves the same. In practice, any user that writes a regex either can encode the maximum length in the regex itself or automatically receives a length requirement. A pattern like
abc[0-9]
will only ever accept a length of exactly 4; a pattern like the OP's^a+b$
could (or should, even) just as easily replace the+
with a length rule. I can't think of any real use case where regex and min/max lengths have a reason to exist together, but allowing this to be done is, as you yourself describe, rather cumbersome and imperfect, even if it's technically possible.Heh, wish I could tell that to a client of ours that is hellbent on enforcing both. Alas, the current solution I implemented is the good ol' try until you succeed method (also known as generating the string in a while loop until Pydantic is happy, which is utterly horrible, but it works :')). For what it's worth though, I agree: mixing regular expressions and length constraints constitutes a misuse of the tool
If you're interested, I have an old (now abandoned) project where I basically made an engine to make regexes generate strings rather than parse them: https://gitlab.com/brzrkr/genex
I abandoned it because it stopped being fun, but it kinda works. Except it explodes if the regex is too complex. But it might serve as inspiration I guess. (Disclaimer: it's old code so it might suck.)
When a
Field
specified bothmax_length
and regex that includes start and end of string tokens and a repeatable pattern can lead to generation of invalid string value that leads toValidationError
. See reproduction:As far as I can tell, this is a result of this piece of code cutting off the end of string after calling
RegexFactory
. I was surprised that the test suite didn't catch this error earlier, but to my surprise the test case that is supposed to verify this behavior will not report any issues if the string produced byhandle_constrained_string
doesn't match the regex. Not sure if that was done on purpose, but addingassert match is not None
leads to this test failing.I can't think of a quick solution to this issue, however I think having a note in the documentation regarding regex fields could be helpful.
I am interested in working on a more structured solution to this issue, unless it's unlikely to be merged. :)