litestar-org / polyfactory

Simple and powerful factories for mock data generation
https://polyfactory.litestar.dev/
MIT License
1.04k stars 81 forks source link

fix: respect both mulitple_of and minimum/maximum constraints #505

Open richardxia opened 8 months ago

richardxia commented 8 months ago

Pull Request Checklist

Description

Library Code Changes

Previously, generate_constrained_number() would potentially generate invalid numbers when multiple_of is not None and exactly one of either minimum or maximum is not None, since it would just return multiple_of without respecting the upper or lower bound.

This significantly changes the implementation of the code to correctly handle this code. The generate_constrained_number() method has been completely removed, being replaced with a generate_constrained_multiple_of() function. A major difference between the old function and the new function is that the new one does not accept a method argument for generating random numbers. This is because in the new function, we always use create_random_integer(), since the problem reduces to generating a random integer multiplier.

The high-level algorithm behind generate_constrained_multiple_of() is that we need to constrain the random integer generator to generate numbers such that when they are multiplied with multiple_of, they still fit within the original bounds constraints. This simplify involves dividing the original bounds by multiple_of, with some special handling for negative multiple_of numbers as well as carefully chosen rounding behavior.

We also need to make some changes to other functions.

get_increment() needs to take an additional argument for the actual value that the increment is for. This is because floating-point numbers can't use a static increment or else it might get rounded away if the numbers are too large. Python fortunately provides a math.ulp() function for computing this for a given float value, so we make use of that function. We still use the original float_info.epsilon constant as a lower bound on the increment, though, since in the case that the value is too close to zero, we still need to make sure that the increment doesn't disappear when used against other numbers.

Finally, we rename and modify passes_pydantic_multiple_validator() to is_almost_multiple_of(), modifying its implementation to defer the casting of values to float() to minimize rounding errors. This specifically affects Decimal numbers, where casting to float too early causes too much loss of precision.

Test Changes

A significant number of changes were made to the tests as well, since the original tests missed the bug being fixed here. Each of the integer, floating-point, and decimal tests has been updated to assert that the result is actually within the minimum and maximum constraints. In addition, we remove some unnecessary sorting of the randomly generated test input values, since this was unnecessarily constraining multiple_of to be greater than or less than the minimum and maximum values. This was causing a lot of the scenarios involving negative values to be skipped.

Lastly, the floating-point and decimal tests need additional constraints to avoid unrealistic extreme values from hitting precision issues. This was done by adding a number of constraints on the number of significant digits in the input numbers and on the relative magnitudes of the input numbers.

Close Issue(s)

guacs commented 8 months ago

@richardxia thanks for the writeup! I haven't had a chance to look into it in detail yet, but I'll do that soon :)

Also, feel free to change the API if it's necessary. Or add a new function to handle these cases. For example, you mentioned passes_pydantic_multiple_validator - you can change that to handle floats and Decimal types differently if that'll make the implementation easer.

guacs commented 7 months ago

@richardxia sorry for the delay, but I've had a chance to look into the PR.

So I think the biggest issues that we're facing here is the opaque nature of method in generate_constrained_number. I am comfortable with deprecating that and then using dedicated functions for each numerical type i.e. generate_constrained_integer, generate_constrained_float etc. These would have no need to take in a method but could decide what method to create the constrained number depending on the constraints and other context. Furthermore, each of them could determine whether the generated number passes the constraints however they wish without relying on pydantic_pydantic_multiple_validator (this is a faulty implementation in pydantic as well as seen in this issue though I think that's only for pydantic v1). WDYT?

However, these would have to be deterministic since users have to be able to reliably create the same instances with the same values if needed (based on the seeding of the Random instance on the factory).

I think we have room for making whatever changes we need here, and deprecate accordingly, without too many issues since most, if not all, users wouldn't be using these functions but only the APIs directly related to the factories.

guacs commented 7 months ago

Also, I think we should be fine as long as reasonable scenarios are working, If hypothesis keeps coming up with extreme cases that won't work, then we can just restrict hypothesis or something along those lines and just document that such extreme cases may fail.

richardxia commented 7 months ago

@guacs, thanks for taking the time to review my PR and leaving comments. What you said sounds good to me, in terms of deprecating the method argument to generator_constrained_number(). I think I'll probably be busy for the rest of the work week, but I may have some time over the weekend to revisit this and hopefully come up with a simpler change.

richardxia commented 7 months ago

I pushed a completely rewritten solution to this branch, but it looks like I'm currently failing the Python 3.8 tests because I accidentally used a function that was only implemented in Python 3.9. I'll have to revisit this tomorrow.

In the meantime, please let me know if the new changes I made look like they're in the right direction. I made some breaking changes to the functions in constrained_numbers.py. I wasn't sure if it was important to keep these functions backwards-compatible, but let me know if it is, and I can think about how to do that. get_increment() will be tricky to preserve backwards compatibility, since it's technically impossible for that function to do its job correctly with its current function signature.

github-actions[bot] commented 7 months ago

Documentation preview will be available shortly at https://litestar-org.github.io/polyfactory-docs-preview/505