unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.37k stars 310 forks source link

pa.dataframe_check causes generated data to be invalid #524

Closed ghost closed 3 years ago

ghost commented 3 years ago

Hi, below is a minimal example of the question / possible bug. Is it expected that after adding a dataframe_check in SchemaWithDFCheck that the schema no longer generates valid data according to the field column checks? This example is still somewhat non-deterministic, please let me know if there is a better way than @seed(10) to get reproducible results.

I do get a warning when running this example, but I would have expected the generated data to still be valid.

UserWarning: Dataframe check doesn't have a defined strategy. Falling back to filtering drawn values based on the check definition. This can considerably slow down data-generation.

Versions:

import pandera as pa
import pandas as pd
from pandera.typing import Series
from hypothesis import seed

class Schema(pa.SchemaModel):
    field: Series[float] = pa.Field(gt=0)

class SchemaWithDFCheck(Schema):
    @pa.dataframe_check
    def non_empty(self, df: pd.DataFrame) -> bool:
        return not df.empty

@seed(10)
def test():
    print(Schema.example(size=1))
    '''
    >>> field
    0   4.940656e-324
    '''
    print(SchemaWithDFCheck.example(size=1))
    '''
    >>> field
    0   0.0
    '''

if __name__ == '__main__':
    test()
cosmicBboy commented 3 years ago

this is certainly a bug, thanks for catching @bphillips-exos! looking into it

cosmicBboy commented 3 years ago

fixed by #550 @bphillips-exos this should be out in the next 0.6.5 release!