litestar-org / polyfactory

Simple and powerful factories for mock data generation
https://polyfactory.litestar.dev/
MIT License
1.03k stars 81 forks source link

Bug: constrained string regex + min/max length may produce invalid values #124

Open mishok13 opened 1 year ago

mishok13 commented 1 year ago

When a Field specified both max_length and regex that includes start and end of string tokens and a repeatable pattern can lead to generation of invalid string value that leads to ValidationError. See reproduction:

from pydantic import BaseModel, Field
from pydantic_factories import ModelFactory
from pydantic_factories.value_generators.regex import RegexFactory

PATTERN = r'^a+b$'
GOOD_SEED = 0
BAD_SEED = 5

class A(BaseModel):
    a: str = Field(..., regex=pattern, min_length=2, max_length=10)

class AF(ModelFactory[A]):
    __model__=A

AF.seed_random(GOOD_SEED)
print(AF.build()) # a='aaaaaaab'

print(RegexFactory(seed=BAD_SEED)(pattern)) # aaaaaaaaaab
AF.seed_random(BAD_SEED)
print(AF.build()) # this breaks
Traceback (most recent call last):
  File "[redacted]reproduce-bug.py", line 18, in <module>
    print(AF.build()) # This breaks
  File "[redacted]/factory.py", line 724, in build
    return cast("T", cls.__model__(**kwargs))  # pyright: ignore
  File "pydantic/main.py", line 342, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for A
a
  string does not match regex "^a+b$" (type=value_error.str.regex; pattern=^a+b$)

As far as I can tell, this is a result of this piece of code cutting off the end of string after calling RegexFactory. I was surprised that the test suite didn't catch this error earlier, but to my surprise the test case that is supposed to verify this behavior will not report any issues if the string produced by handle_constrained_string doesn't match the regex. Not sure if that was done on purpose, but adding assert match is not None leads to this test failing.

I can't think of a quick solution to this issue, however I think having a note in the documentation regarding regex fields could be helpful.

I am interested in working on a more structured solution to this issue, unless it's unlikely to be merged. :)

Fund with Polar

jtraub commented 1 year ago

Hi Andrii

I am interested in working on a more structured solution to this issue, unless it's unlikely to be merged. :)

It would be great if you do so. I am willing to help.

will not report any issues if the string produced by handle_constrained_string doesn't match the regex

I've encountered a similar problem recently (see #118) so we definitely need to adjust tests or add more tests that would do full "end2end" testing making sure that generated values do not raise ValidationErrors if supplied to constrained fields.

mishok13 commented 1 year ago

Right on, I'll make sure to get some time to work on this in the coming days, will get back to you once I have a clearer idea of a solution here.

jtraub commented 1 year ago

@Goldziher I want to take care of it soon and it seems that we need to change this part

We have to pass min_length and max_length to RegexFactory.__call__ in order to be able to generate proper strings for regexes that contain repetitions and patterns like ^..$. So my idea is to completely get rid of the highlighted part in the handle_constrained_string and instead alter limit for repetitions parts iteratively inside.

WDYT?

Goldziher commented 1 year ago

sure

nocturn9x commented 11 months ago

Is this still not fixed?

guacs commented 11 months ago

Is this still not fixed?

Unfortunately, no.

nocturn9x commented 11 months ago

Well, that's a shame indeed. I'll have to see if I can work on making a PR in the coming weeks, because this project is really useful and it'd be a shame to leave it in a broken state :)

guacs commented 11 months ago

Well, that's a shame indeed. I'll have to see if I can work on making a PR in the coming weeks, because this project is really useful and it'd be a shame to leave it in a broken state :)

That would be awesome if you could raise a PR!

nocturn9x commented 11 months ago

Starting work on the PR for this issue. Does anyone have any extra guidance before I begin?

nocturn9x commented 11 months ago

Wouldn't wrapping the regex into a non-capturing group work? Say your regex is pattern, you could feed this string to the regex factory: (?:pattern){min,max}

Nope, doesn't work. Didn't know what I was thinking to be honest. I'm working on the solution proposed above (iteratively updating self._limit inside RegexFactory)

guacs commented 11 months ago

@nocturn9x sorry for the delayed response. Unfortunately, I don't have any suggestions. I'm probably not the best person when it comes to regex. As long as you have a solution with corresponding tests that pass, I think we'll be good to go :)

nocturn9x commented 11 months ago

The thing I realized is that this is a much more complicated issue than it seems. On a theoretical level, a regex engine is a Deterministic Finite Automaton (or a Nondeterministic one, if it implements PCRE-like features). A regex is basically a recipe for building an automaton that can accept or reject a given input string (it defines a regular language). Now, if we want to add length constraints to it, we need to have another DFA (or NFA) that does that: mathematically, the construction is rather simple (it's just the Cartesian product of the two), and it is known that the intersection of two DFAs (or NFAs) always exists; In practice this means we'd need an automaton that has as many states as the one built for the regex, and the input string would need to be accepted by both to be valid. That is a rather non-trivial problem to solve, especially if performance is a concern

nocturn9x commented 11 months ago

Also, another thing: we have to remember that it may not always be possible to generate a string that can match a regex and stay within some defined length constraints. For example, if we had something as simple as (a|b|c)[0-9]+ and we wanted our generated string to be of length 1, it would be impossible to satisfy both conditions (because the pattern itself mandates at least 2 symbols). This is something that can likely be detected ahead of time and accounted for by raising an exception

guacs commented 10 months ago

@nocturn9x any updates on this?

nocturn9x commented 10 months ago

@nocturn9x any updates on this?

Heh, currently no. It's a much more complicated issue than it seems at first

berzi commented 10 months ago

I think logically there is no way to consistently solve this problem.

Regex pattern complexity is potentially unbounded, and there is no way to know where to restrict the length of an output string without carefully parsing the regex, which can yield very complex results (imagine many nested groups and |s), which can even conflict! Nothing really stops a user from using a pattern that violates the min_length or max_length restrictions, and if a system to verify these restrictions were to be added, it would fall upon Pydantic (in this case) to implement it.

I think Polyfactory shouldn't try to solve such a hard problem. This SO question asks for how to calculate the maximum and minimum length of strings that would match a regex (with no practical or easy solution at the time of writing this) which could be used to simply compare the minimum and maximum with any min_length and max_length to fail early if there is a mismatch, but again, I don't think it's Polyfactory's work to do this, and parsing the pattern would be rather expensive.

IMO, when there is a pattern defined, Polyfactory should just respect it, ignoring any other restriction that may conflict (like min_length or max_length). Maybe a warning can be fired if regex restrictions are found together with potentially conflicting others. This should be documented in any case.

nocturn9x commented 10 months ago

@berzi I agree with your assessment. This is a highly non-trivial problem to solve, and even a hypothetical solution would probably be quite computationally expensive

guacs commented 10 months ago

@berzi, @nocturn9x thanks for your input! I agree with both of your assessment as well. I was thinking of raising an error instead of a warning. My reasoning is that if the produced string doesn't conform to the constraints, then it will result in a validation error by whichever library they're using for validation. At that point, the users may think that this is a bug from the side of polyfactory.

nocturn9x commented 10 months ago

I think logically there is no way to consistently solve this problem.

Regex pattern complexity is potentially unbounded, and there is no way to know where to restrict the length of an output string without carefully parsing the regex, which can yield very complex results (imagine many nested groups and |s), which can even conflict! Nothing really stops a user from using a pattern that violates the min_length or max_length restrictions, and if a system to verify these restrictions were to be added, it would fall upon Pydantic (in this case) to implement it.

I think Polyfactory shouldn't try to solve such a hard problem. This SO question asks for how to calculate the maximum and minimum length of strings that would match a regex (with no practical or easy solution at the time of writing this) which could be used to simply compare the minimum and maximum with any min_length and max_length to fail early if there is a mismatch, but again, I don't think it's Polyfactory's work to do this, and parsing the pattern would be rather expensive.

IMO, when there is a pattern defined, Polyfactory should just respect it, ignoring any other restriction that may conflict (like min_length or max_length). Maybe a warning can be fired if regex restrictions are found together with potentially conflicting others. This should be documented in any case.

Actually, reading the linked SO answer it doesn't seem like it would be too hard to implement something that can at least tell what the minimum and maximum bounds for a regex are, and error out if they're not within some user-defined conditions. Turn the regex into an NFA, remove reject states, then use Dijkstra (or BF) to find out the shortest path to an accepting state (the resulting path length will be the minimum length that such automaton will accept). For the maximum length, use DFS or a similar algorithm to detect if the graph of the automaton has cycles: if it does, the max length of a match is infinity, if it doesn't then find the longest path that ends in an accept state (probably by brute force, as any algorithms to do so are probably slow anyway). This would at least make sure that the library can detect an invalid combination of pattern/bounds, although it doesn't strictly speaking solve the generation part of the problem. Still, IMHO, better than nothing

berzi commented 10 months ago

I think logically there is no way to consistently solve this problem. Regex pattern complexity is potentially unbounded, and there is no way to know where to restrict the length of an output string without carefully parsing the regex, which can yield very complex results (imagine many nested groups and |s), which can even conflict! Nothing really stops a user from using a pattern that violates the min_length or max_length restrictions, and if a system to verify these restrictions were to be added, it would fall upon Pydantic (in this case) to implement it. I think Polyfactory shouldn't try to solve such a hard problem. This SO question asks for how to calculate the maximum and minimum length of strings that would match a regex (with no practical or easy solution at the time of writing this) which could be used to simply compare the minimum and maximum with any min_length and max_length to fail early if there is a mismatch, but again, I don't think it's Polyfactory's work to do this, and parsing the pattern would be rather expensive. IMO, when there is a pattern defined, Polyfactory should just respect it, ignoring any other restriction that may conflict (like min_length or max_length). Maybe a warning can be fired if regex restrictions are found together with potentially conflicting others. This should be documented in any case.

Actually, reading the linked SO answer it doesn't seem like it would be too hard to implement something that can at least tell what the minimum and maximum bounds for a regex are, and error out if they're not within some user-defined conditions. Turn the regex into an NFA, remove reject states, then use Dijkstra (or BF) to find out the shortest path to an accepting state (the resulting path length will be the minimum length that such automaton will accept). For the maximum length, use DFS or a similar algorithm to detect if the graph of the automaton has cycles: if it does, the max length of a match is infinity, if it doesn't then find the longest path that ends in an accept state (probably by brute force, as any algorithms to do so are probably slow anyway). This would at least make sure that the library can detect an invalid combination of pattern/bounds, although it doesn't strictly speaking solve the generation part of the problem. Still, IMHO, better than nothing

That sounds like a significant performance cost (consider that users may have many fields with arbitrarily large regexes) for not very much, especially if the best goal that can be achieved is guarding for errors: the alternative of simply detecting the presence of quasi-incompatible fields is computationally negligible and achieves the same.

In practice, any user that writes a regex either can encode the maximum length in the regex itself or automatically receives a length requirement. A pattern like abc[0-9] will only ever accept a length of exactly 4; a pattern like the OP's ^a+b$ could (or should, even) just as easily replace the + with a length rule. I can't think of any real use case where regex and min/max lengths have a reason to exist together, but allowing this to be done is, as you yourself describe, rather cumbersome and imperfect, even if it's technically possible.

nocturn9x commented 10 months ago

I think logically there is no way to consistently solve this problem. Regex pattern complexity is potentially unbounded, and there is no way to know where to restrict the length of an output string without carefully parsing the regex, which can yield very complex results (imagine many nested groups and |s), which can even conflict! Nothing really stops a user from using a pattern that violates the min_length or max_length restrictions, and if a system to verify these restrictions were to be added, it would fall upon Pydantic (in this case) to implement it. I think Polyfactory shouldn't try to solve such a hard problem. This SO question asks for how to calculate the maximum and minimum length of strings that would match a regex (with no practical or easy solution at the time of writing this) which could be used to simply compare the minimum and maximum with any min_length and max_length to fail early if there is a mismatch, but again, I don't think it's Polyfactory's work to do this, and parsing the pattern would be rather expensive. IMO, when there is a pattern defined, Polyfactory should just respect it, ignoring any other restriction that may conflict (like min_length or max_length). Maybe a warning can be fired if regex restrictions are found together with potentially conflicting others. This should be documented in any case.

Actually, reading the linked SO answer it doesn't seem like it would be too hard to implement something that can at least tell what the minimum and maximum bounds for a regex are, and error out if they're not within some user-defined conditions. Turn the regex into an NFA, remove reject states, then use Dijkstra (or BF) to find out the shortest path to an accepting state (the resulting path length will be the minimum length that such automaton will accept). For the maximum length, use DFS or a similar algorithm to detect if the graph of the automaton has cycles: if it does, the max length of a match is infinity, if it doesn't then find the longest path that ends in an accept state (probably by brute force, as any algorithms to do so are probably slow anyway). This would at least make sure that the library can detect an invalid combination of pattern/bounds, although it doesn't strictly speaking solve the generation part of the problem. Still, IMHO, better than nothing

That sounds like a significant performance cost (consider that users may have many fields with arbitrarily large regexes) for not very much, especially if the best goal that can be achieved is guarding for errors: the alternative of simply detecting the presence of quasi-incompatible fields is computationally negligible and achieves the same.

In practice, any user that writes a regex either can encode the maximum length in the regex itself or automatically receives a length requirement. A pattern like abc[0-9] will only ever accept a length of exactly 4; a pattern like the OP's ^a+b$ could (or should, even) just as easily replace the + with a length rule. I can't think of any real use case where regex and min/max lengths have a reason to exist together, but allowing this to be done is, as you yourself describe, rather cumbersome and imperfect, even if it's technically possible.

Heh, wish I could tell that to a client of ours that is hellbent on enforcing both. Alas, the current solution I implemented is the good ol' try until you succeed method (also known as generating the string in a while loop until Pydantic is happy, which is utterly horrible, but it works :')). For what it's worth though, I agree: mixing regular expressions and length constraints constitutes a misuse of the tool

berzi commented 10 months ago

I think logically there is no way to consistently solve this problem. Regex pattern complexity is potentially unbounded, and there is no way to know where to restrict the length of an output string without carefully parsing the regex, which can yield very complex results (imagine many nested groups and |s), which can even conflict! Nothing really stops a user from using a pattern that violates the min_length or max_length restrictions, and if a system to verify these restrictions were to be added, it would fall upon Pydantic (in this case) to implement it. I think Polyfactory shouldn't try to solve such a hard problem. This SO question asks for how to calculate the maximum and minimum length of strings that would match a regex (with no practical or easy solution at the time of writing this) which could be used to simply compare the minimum and maximum with any min_length and max_length to fail early if there is a mismatch, but again, I don't think it's Polyfactory's work to do this, and parsing the pattern would be rather expensive. IMO, when there is a pattern defined, Polyfactory should just respect it, ignoring any other restriction that may conflict (like min_length or max_length). Maybe a warning can be fired if regex restrictions are found together with potentially conflicting others. This should be documented in any case.

Actually, reading the linked SO answer it doesn't seem like it would be too hard to implement something that can at least tell what the minimum and maximum bounds for a regex are, and error out if they're not within some user-defined conditions. Turn the regex into an NFA, remove reject states, then use Dijkstra (or BF) to find out the shortest path to an accepting state (the resulting path length will be the minimum length that such automaton will accept). For the maximum length, use DFS or a similar algorithm to detect if the graph of the automaton has cycles: if it does, the max length of a match is infinity, if it doesn't then find the longest path that ends in an accept state (probably by brute force, as any algorithms to do so are probably slow anyway). This would at least make sure that the library can detect an invalid combination of pattern/bounds, although it doesn't strictly speaking solve the generation part of the problem. Still, IMHO, better than nothing

That sounds like a significant performance cost (consider that users may have many fields with arbitrarily large regexes) for not very much, especially if the best goal that can be achieved is guarding for errors: the alternative of simply detecting the presence of quasi-incompatible fields is computationally negligible and achieves the same. In practice, any user that writes a regex either can encode the maximum length in the regex itself or automatically receives a length requirement. A pattern like abc[0-9] will only ever accept a length of exactly 4; a pattern like the OP's ^a+b$ could (or should, even) just as easily replace the + with a length rule. I can't think of any real use case where regex and min/max lengths have a reason to exist together, but allowing this to be done is, as you yourself describe, rather cumbersome and imperfect, even if it's technically possible.

Heh, wish I could tell that to a client of ours that is hellbent on enforcing both. Alas, the current solution I implemented is the good ol' try until you succeed method (also known as generating the string in a while loop until Pydantic is happy, which is utterly horrible, but it works :')). For what it's worth though, I agree: mixing regular expressions and length constraints constitutes a misuse of the tool

If you're interested, I have an old (now abandoned) project where I basically made an engine to make regexes generate strings rather than parse them: https://gitlab.com/brzrkr/genex

I abandoned it because it stopped being fun, but it kinda works. Except it explodes if the regex is too complex. But it might serve as inspiration I guess. (Disclaimer: it's old code so it might suck.)