unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

hypothesis.errors.Unsatisfiable on Schema.to_schema().example() when SchemaModel has more than 38 fields #838

Closed g-simmons closed 1 year ago

g-simmons commented 2 years ago

Describe the bug If a SchemaModel contains more than 38 fields, SchemaModel.to_schema().example() throws an error:

hypothesis.errors.Unsatisfiable: Unable to satisfy assumptions of example_generating_inner_function

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandera as pa
from pandera.typing import Series

class MyBaseSchema(pa.SchemaModel):
    field1: Series[str] = pa.Field()
    field2: Series[str] = pa.Field()
    field3: Series[str] = pa.Field()
    field4: Series[str] = pa.Field()
    field5: Series[str] = pa.Field()
    field6: Series[str] = pa.Field()
    field7: Series[str] = pa.Field()
    field8: Series[str] = pa.Field()
    field9: Series[str] = pa.Field()
    field10: Series[str] = pa.Field()
    field11: Series[str] = pa.Field()
    field12: Series[str] = pa.Field()
    field13: Series[str] = pa.Field()
    field14: Series[str] = pa.Field()
    field15: Series[str] = pa.Field()
    field16: Series[str] = pa.Field()
    field17: Series[str] = pa.Field()
    field18: Series[str] = pa.Field()
    field19: Series[str] = pa.Field()
    field20: Series[str] = pa.Field()
    field21: Series[str] = pa.Field()
    field22: Series[str] = pa.Field()
    field23: Series[str] = pa.Field()
    field24: Series[str] = pa.Field()
    field25: Series[str] = pa.Field()
    field26: Series[str] = pa.Field()
    field27: Series[str] = pa.Field()
    field28: Series[str] = pa.Field()
    field29: Series[str] = pa.Field()
    field30: Series[str] = pa.Field()
    field31: Series[str] = pa.Field()
    field32: Series[str] = pa.Field()
    field33: Series[str] = pa.Field()
    field34: Series[str] = pa.Field()
    field35: Series[str] = pa.Field()
    field36: Series[str] = pa.Field()
    field37: Series[str] = pa.Field()
    field38: Series[str] = pa.Field()
    field39: Series[str] = pa.Field()
    field40: Series[str] = pa.Field()

if __name__ == "__main__":
    dataframe = MyBaseSchema.to_schema().example(1)
    print(dataframe)

Expected behavior

Don't throw an error, generate an example for the SchemaModel.

Desktop (please complete the following information):

cosmicBboy commented 2 years ago

hi @g-simmons, this looks like a performance issue on the dataframe strategy generation function. I suspect it has something to do with this: https://github.com/pandera-dev/pandera/blob/master/pandera/strategies.py#L1104-L1112

        for col_name, col_dtype in col_dtypes.items():
            if col_dtype in {"object", "str"} or col_dtype.startswith(
                "string"
            ):
                # pylint: disable=cell-var-from-loop,undefined-loop-variable
                strategy = strategy.map(
                    lambda df: df.assign(**{col_name: df[col_name].map(str)})
                )

It would be better to collect the string columns and then apply a list of columns in strategy.map:

col_names = []
for col_name, col_dtype in col_dtypes.items():
    if col_dtype in {"object", "str"} or col_dtype.startswith(
        "string"
    ):
        col_names.append(col_name)

strategy = strategy.map(
    lambda df: df.assign(**{col_name: df[col_name].map(str) for col_name in col_names})
)

I don't have the bandwidth to tackle this right now, but please feel free to make a PR for this! (also adding the "help wanted" tag)

g-simmons commented 2 years ago

@cosmicBboy Great, thanks for the input! I also probably don't have bandwidth to work on it now but will come back later if I do. Thanks!

cosmicBboy commented 1 year ago

this was fixed by https://github.com/unionai-oss/pandera/pull/989