unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.22k stars 300 forks source link

Integrate with Factory Boy? Or other solution? #470

Open schlich opened 3 years ago

schlich commented 3 years ago

Mostly for the purposes of testing and example generation, I would like to see Pandera's schema.example function to incorporate patterns similar to Factory Boy:

Instead of building an exhaustive test setup with every possible combination of corner cases, factory_boy allows you to use objects customized for the current test, while only declaring the test-specific fields:

While this might seem at first like this is a less-thorough version of what Hypothesis does, the bolded part above (emphasis mine) outlines the functionality I am looking for -- the ability to further constrain properties of a DataFrame in a manner appropriate for testing. While conceivably I can create a new schema with further restrictions, this seems like it would get quickly out of hand, and does not incorporate the advantages offered by Factory Boy's pattern

Here's what the author of Hypothesis has to say about using Hypothesis with Factory Boy:

Both Factory Boy and Hypothesis are designed along a “we’re a library, not a framework” approach. Further, factory boy is set up to take arbitrary values, Hypothesis is set up to provide them, so you can easily feed the latter into the former. For example, the following defines a strategy that uses a factory boy UserFactory object to parametrize over unsaved user objects with an arbitrary first name:

class TestUser(TestCase):
    @given(builds(UserFactory.build, first_name=text(max_length=50)))
    def test_can_save_a_user(self, user):
        user.save()

essentially I would like to see UserFactory.build replaced with schema.example. It might be feasible to allow example to take extra kwargs that may then set values for dataframe columns and indexes.

As is, I've had to resort to constructing my factories without Pandera -- my DF has a lot of columns, and thus I had to deal with a lot of code duplication that could have easily been handled by the schema.

If there is an existing pattern that allows for overriding of fields during example generation (preferably without needing to modify the schema), please let me know!

cosmicBboy commented 3 years ago

hi @schlich, thanks for opening up this discussion.

Before going ay further, are you aware of the strategy method? Calling it basically creates a hypothesis strategy object that generates data. In line with what hypothesis recommends, example isn't the recommended way of synthesizing data, it's basically for debugging (example just wraps strategy). Here's an example of using the strategy method.

I understand the gist of what you're going for, though I'm still a little unclear an integration would look like. Since DataFrameSchema.strategy outputs a hypothesis strategy object, you can further constrain it with hypothesis, and even use it with Factory Boy in your own test methods.

It also might help to provide a small reproducible code snippet of illustrating (a) the problem and what you're currently doing to solve it and (b) your ideal solution.

If there is an existing pattern that allows for overriding of fields during example generation (preferably without needing to modify the schema), please let me know!

Something like this would work:

import pandera as pa
from pandera.typing import Series

class Schema(pa.SchemaModel):
    column1: Series[int]
    column2: Series[float]
    column3: Series[str]

strategy = Schema.strategy(size=5)

# replace column1 with 100s
new_strategy = strategy.map(lambda df: df.assign(column1=100))
print(new_strategy.example())

# use new_strategy in tests, etc.
schlich commented 3 years ago

I think the problem might come with the complexity of my dataframes -- i have upwards of 20 columns/index levels, many with dependencies on each other. I began trying to approach my problem with strategies but not only did the strategies end up being very complex (for example, there is no existing check that multiindex rows are unique, or a check to guarantee that each element from a list of choices appears at least once, among other things), but the strategies were prohibitively expensive for my TDD workflow, taking up to 30 seconds for each test, even after reducing the number of examples (reducing the size to about 5-10 worked fine, but that seems to defeat the purpose of hypothesis). I tried everything mentioned here but nothing seemed to fit. Additionally, registering all these custom checks and strategies with Pandera was not a simple task. Since simple is better than complex, i got frustrated and ditched the strategies for the factory boy approach.

I would much rather be able to unit-test by example and integration test by strategy. Happy to be proven wrong, but i need my test runs to be quick!

When I have time, I will try the map/assign pattern above and see if i make any headway. I'll work on producing some reduced examples as well.

cosmicBboy commented 3 years ago

Thanks for continuing the discussion @schlich!

One thing I want to emphasize is that pandera's data synthesis capabilities are still in the early days, but there is one constraint that I think is important to preserve: the symmetry between defining a schema and generating data directly from the constraints defined in the schema itself. Of course the user can always decide to adapt a DataFrameSchema.strategy, but I think there are a couple of things here to make the UX better using pandera.

i have upwards of 20 columns/index levels, many with dependencies on each other

Yeah, unfortunately pandera doesn't really lend itself to generating data with a lot of interdependencies between columns. There is this issue #371, but specifics haven't been figured out yet. Will need to think about how to do that. If you don't mind, could you expand on what kinds of dependencies you rely on? It would be helpful to inform the solution for #371.

for example, there is no existing check that multiindex rows are unique

After #390 is implemented, adding this functionality to strategies would be pretty straightforward

or a check to guarantee that each element from a list of choices appears at least once

I think that could be implemented as a built-in Check (with an associated strategy), since isin doesn't guarantee this.

I also wanted to point you to the extensions module that leets you register checks into the pa.Check namespace. It still doesn't get at conditional dependencies, but it would allow you to implement the above check.

but the strategies were prohibitively expensive for my TDD workflow, taking up to 30 seconds for each test, even after reducing the number of examples (reducing the size to about 5-10 worked fine, but that seems to defeat the purpose of hypothesis)

how many samples do you need for your tests to be meaningful?

Since simple is better than complex, i got frustrated and ditched the strategies for the factory boy approach.

I wouldn't be opposed to a factory boy integration, I think we can further optimize the functionality (and API) of the hypothesis integration and see how far we can get, then consider another integration if we still find performance lacking.

antonl commented 3 years ago

As a side note, I've been using the dataframe checks for dependencies between columns. You can add your own using the extensions API as @cosmicBboy said.

That said, I think there's a fundamental mismatch between the bottom-up and filter approach that pandera takes and the top-down approach that would be required to generate dataframes with complex dependencies efficiently. That's not a bad thing, it optimization for the simpler, but by far the most common case.

I wonder if we could abstract out the current strategy implementation as one high-level approach, but allow users to replace that with their own top-level dataframe strategy if they want? This proposed FactoryBoy integration could be implemented as a coarse-grained sampling strategy. Was that what you were thinking in #371?

schlich commented 3 years ago

As a side note, I've been using the dataframe checks for dependencies between columns. You can add your own using the extensions API as @cosmicBboy said.

Yes, I'm trying to see i can accomplish what i need to with the "group" options in the checks (the documentation here could use a little bit of a facelift IMO) but my guess is it will only take me about 50% of the way there... stay tuned!

cosmicBboy commented 3 years ago

great discussion!

okay, there are a few points that were brought up here worth decomposing:

  1. the schema constraints should only provide a default data synthesis strategy that is potentially inefficient for more complex use cases.
  2. for those complex use cases, there should be an elegant way for users to "bring their own strategy" at multiple levels of the pandera API.
    • the current extensions API only provides customization at the Check level
    • we might consider adding a strategy kwarg to the schema and schema_component objects, e.g. DataFrameSchema and Column, which will override the default strategy provided by pandera, where the user is expected to implement all of the data generation logic to fulfill the constraints in the schema.
  3. native support for conditional checks: there has been some discussion of this but the details are a little challenging to figure out because conditional checks need to (i) be well-designed for both the object- and class-based APIs (ii) have a reasonable implementation with respect to data synthesis strategies.

Yes, I'm trying to see i can accomplish what i need to with the "group" options in the checks (the documentation here could use a little bit of a facelift IMO)

Yes, the groupby check options are not ideal, and I'm working on some updates to make this more intuitive... that said the native support for conditional checks is something that might be generally useful but also for this particular use case.

Re: Factory Boy integration, I think the fact that hypothesis and factory boy seem to play well together according to the post referenced by @schlich tells me that for now pandera's entrypoint to synthesizing data should stick with the hypothesis nomenclature and semantics (i.e. DataFrameSchema.strategy() and example() methods.

This proposed FactoryBoy integration could be implemented as a coarse-grained sampling strategy. Was that what you were thinking in #371?

Not quite, I think the coarse-grained sampling strategy would be more in line with (2), for #371 I was thinking of native support for conditional checks with a default implementation for the hypothesis strategy. For (2) I'm thinking something like:

import pandera as pa
from hypothesis import builds
from hypothesis.strategies import SearchStrategy
from factories import SomeFactory

# takes one argument, which is the schema object that this strategies is applied to
# and returns a hypothesis strategy
def custom_strategy(schema: pa.DataFrameSchema) -> SearchStrategy:
    # e.g. using factory boy, but this could also be a hypothesis strategy
    return builds(SomeFactory.build, ...)

schema = pa.DataFrameSchema(
    ...,
    strategy=custom_strategy
)

strategy = schema.strategy()  # to use in test suite
example = schema.example()  # to generate examples on the fly for debugging