Open schlich opened 3 years ago
hi @schlich, thanks for opening up this discussion.
Before going ay further, are you aware of the strategy
method? Calling it basically creates a hypothesis strategy object that generates data. In line with what hypothesis
recommends, example
isn't the recommended way of synthesizing data, it's basically for debugging (example
just wraps strategy
). Here's an example of using the strategy method.
I understand the gist of what you're going for, though I'm still a little unclear an integration would look like. Since DataFrameSchema.strategy
outputs a hypothesis strategy object, you can further constrain it with hypothesis, and even use it with Factory Boy in your own test methods.
It also might help to provide a small reproducible code snippet of illustrating (a) the problem and what you're currently doing to solve it and (b) your ideal solution.
If there is an existing pattern that allows for overriding of fields during example generation (preferably without needing to modify the schema), please let me know!
Something like this would work:
import pandera as pa
from pandera.typing import Series
class Schema(pa.SchemaModel):
column1: Series[int]
column2: Series[float]
column3: Series[str]
strategy = Schema.strategy(size=5)
# replace column1 with 100s
new_strategy = strategy.map(lambda df: df.assign(column1=100))
print(new_strategy.example())
# use new_strategy in tests, etc.
I think the problem might come with the complexity of my dataframes -- i have upwards of 20 columns/index levels, many with dependencies on each other. I began trying to approach my problem with strategies but not only did the strategies end up being very complex (for example, there is no existing check that multiindex rows are unique, or a check to guarantee that each element from a list of choices appears at least once, among other things), but the strategies were prohibitively expensive for my TDD workflow, taking up to 30 seconds for each test, even after reducing the number of examples (reducing the size to about 5-10 worked fine, but that seems to defeat the purpose of hypothesis). I tried everything mentioned here but nothing seemed to fit. Additionally, registering all these custom checks and strategies with Pandera was not a simple task. Since simple is better than complex, i got frustrated and ditched the strategies for the factory boy approach.
I would much rather be able to unit-test by example and integration test by strategy. Happy to be proven wrong, but i need my test runs to be quick!
When I have time, I will try the map/assign pattern above and see if i make any headway. I'll work on producing some reduced examples as well.
Thanks for continuing the discussion @schlich!
One thing I want to emphasize is that pandera's data synthesis capabilities are still in the early days, but there is one constraint that I think is important to preserve: the symmetry between defining a schema and generating data directly from the constraints defined in the schema itself. Of course the user can always decide to adapt a DataFrameSchema.strategy
, but I think there are a couple of things here to make the UX better using pandera.
i have upwards of 20 columns/index levels, many with dependencies on each other
Yeah, unfortunately pandera doesn't really lend itself to generating data with a lot of interdependencies between columns. There is this issue #371, but specifics haven't been figured out yet. Will need to think about how to do that. If you don't mind, could you expand on what kinds of dependencies you rely on? It would be helpful to inform the solution for #371.
for example, there is no existing check that multiindex rows are unique
After #390 is implemented, adding this functionality to strategies
would be pretty straightforward
or a check to guarantee that each element from a list of choices appears at least once
I think that could be implemented as a built-in Check (with an associated strategy), since isin
doesn't guarantee this.
I also wanted to point you to the extensions module that leets you register checks into the pa.Check
namespace. It still doesn't get at conditional dependencies, but it would allow you to implement the above check.
but the strategies were prohibitively expensive for my TDD workflow, taking up to 30 seconds for each test, even after reducing the number of examples (reducing the size to about 5-10 worked fine, but that seems to defeat the purpose of hypothesis)
how many samples do you need for your tests to be meaningful?
Since simple is better than complex, i got frustrated and ditched the strategies for the factory boy approach.
I wouldn't be opposed to a factory boy integration, I think we can further optimize the functionality (and API) of the hypothesis integration and see how far we can get, then consider another integration if we still find performance lacking.
As a side note, I've been using the dataframe checks for dependencies between columns. You can add your own using the extensions API as @cosmicBboy said.
That said, I think there's a fundamental mismatch between the bottom-up and filter approach that pandera
takes and the top-down approach that would be required to generate dataframes with complex dependencies efficiently. That's not a bad thing, it optimization for the simpler, but by far the most common case.
I wonder if we could abstract out the current strategy implementation as one high-level approach, but allow users to replace that with their own top-level dataframe strategy if they want? This proposed FactoryBoy integration could be implemented as a coarse-grained sampling strategy. Was that what you were thinking in #371?
As a side note, I've been using the dataframe checks for dependencies between columns. You can add your own using the extensions API as @cosmicBboy said.
Yes, I'm trying to see i can accomplish what i need to with the "group" options in the checks (the documentation here could use a little bit of a facelift IMO) but my guess is it will only take me about 50% of the way there... stay tuned!
great discussion!
okay, there are a few points that were brought up here worth decomposing:
extensions
API only provides customization at the Check
levelstrategy
kwarg to the schema
and schema_component
objects, e.g. DataFrameSchema
and Column
, which will override the default strategy provided by pandera
, where the user is expected to implement all of the data generation logic to fulfill the constraints in the schema.Yes, I'm trying to see i can accomplish what i need to with the "group" options in the checks (the documentation here could use a little bit of a facelift IMO)
Yes, the groupby
check options are not ideal, and I'm working on some updates to make this more intuitive... that said the native support for conditional checks is something that might be generally useful but also for this particular use case.
Re: Factory Boy integration, I think the fact that hypothesis and factory boy seem to play well together according to the post referenced by @schlich tells me that for now pandera's entrypoint to synthesizing data should stick with the hypothesis
nomenclature and semantics (i.e. DataFrameSchema.strategy()
and example()
methods.
This proposed FactoryBoy integration could be implemented as a coarse-grained sampling strategy. Was that what you were thinking in #371?
Not quite, I think the coarse-grained sampling strategy would be more in line with (2), for #371 I was thinking of native support for conditional checks with a default implementation for the hypothesis strategy. For (2) I'm thinking something like:
import pandera as pa
from hypothesis import builds
from hypothesis.strategies import SearchStrategy
from factories import SomeFactory
# takes one argument, which is the schema object that this strategies is applied to
# and returns a hypothesis strategy
def custom_strategy(schema: pa.DataFrameSchema) -> SearchStrategy:
# e.g. using factory boy, but this could also be a hypothesis strategy
return builds(SomeFactory.build, ...)
schema = pa.DataFrameSchema(
...,
strategy=custom_strategy
)
strategy = schema.strategy() # to use in test suite
example = schema.example() # to generate examples on the fly for debugging
Mostly for the purposes of testing and example generation, I would like to see Pandera's
schema.example
function to incorporate patterns similar to Factory Boy:While this might seem at first like this is a less-thorough version of what Hypothesis does, the bolded part above (emphasis mine) outlines the functionality I am looking for -- the ability to further constrain properties of a DataFrame in a manner appropriate for testing. While conceivably I can create a new schema with further restrictions, this seems like it would get quickly out of hand, and does not incorporate the advantages offered by Factory Boy's pattern
Here's what the author of Hypothesis has to say about using Hypothesis with Factory Boy:
essentially I would like to see
UserFactory.build
replaced withschema.example
. It might be feasible to allowexample
to take extra kwargs that may then set values for dataframe columns and indexes.As is, I've had to resort to constructing my factories without Pandera -- my DF has a lot of columns, and thus I had to deal with a lot of code duplication that could have easily been handled by the schema.
If there is an existing pattern that allows for overriding of fields during example generation (preferably without needing to modify the schema), please let me know!