unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.05k stars 281 forks source link

Hypothesis examples are all the same #1579

Closed tmcclintock closed 1 week ago

tmcclintock commented 1 month ago

Describe the bug Calling schema.example() generates size number of identical rows. This is not desirable, since the whole purpose of pandera + hypothesis is to create rich examples with very a lot of variety in the rows.

Code Sample, a copy-pastable example

import hypothesis
import pandera
from pandera import Check, Column, DataFrameSchema

print(hypothesis.__version__, pandera.__version__)

schema = DataFrameSchema(
    {
        "column1": Column(int, Check.ge(0)),
        "column2": Column(int, [Check.ge(1), Check.le(100)]),
        "column3": Column(float, Check.ge(0)),
        "column4": Column(str, Check.isin(["AAA", "BBB", "CCC"])),
    }
)

print(schema.example(size=20))

yields

6.100.1 0.18.3
    column1  column2  column3 column4
0         0        1      0.0     AAA
1         0        1      0.0     AAA
2         0        1      0.0     AAA
3         0        1      0.0     AAA
4         0        1      0.0     AAA
5         0        1      0.0     AAA
6         0        1      0.0     AAA
7         0        1      0.0     AAA
8         0        1      0.0     AAA
9         0        1      0.0     AAA
10        0        1      0.0     AAA
11        0        1      0.0     AAA
12        0        1      0.0     AAA
13        0        1      0.0     AAA
14        0        1      0.0     AAA
15        0        1      0.0     AAA
16        0        1      0.0     AAA
17        0        1      0.0     AAA
18        0        1      0.0     AAA
19        0        1      0.0     AAA

Note that both the pandera and hypothesis versions are latest.

Expected behavior

All 20 columns look very different

Desktop (please complete the following information):

Additional context

I suspected this is related to #1503, but after doing pip install pandera==0.18.0 hypothesis the issue still persisted. Same thing when I dropped down to pandera 0.17.2.

Also the same behavior occurs when using DataFrameModels.

The only time I can get high-entropy rows is when checks is a Check and not a list[Check] like this:

schema = DataFrameSchema(
    {
        "column1": Column(int, Check.ge(0)),
        "column2": Column(int, Check.le(100)),
        "column3": Column(float, Check.ge(0)),
        "column4": Column(str, Check.isin(["AAA", "BBB", "CCC"])),
    }
)

The output of this is as expected:

              column1              column2       column3 column4
0               57854                   42  1.401298e-45     BBB
1               40006          -1198404347  1.192093e-07     BBB
2               44174                   55  2.220446e-16     CCC
3         12935430764 -4986092864707543051  1.000000e-05     AAA

However this is wayyy less useful then having all the checks apply.

cosmicBboy commented 1 month ago

Looks like this is an issue with the way pandera strategies tries to chain together multiple checks, e.g.

schema = DataFrameSchema(
    {
        "column1": Column(int, Check.ge(0)),
        "column2": Column(int, [Check.in_range(1, 100)]),  # 👈 use a single in_range check instead of ge and le
        "column3": Column(float, Check.ge(0)),
        "column4": Column(str, Check.isin(["AAA", "BBB", "CCC"])),
    }
)

produces

6.100.1 0.0.0+dev0
                column1  column2        column3 column4
0                     0        1   3.402823e+38     AAA
1                     0        1   2.882304e+16     CCC
2                     0        6   2.000010e+00     BBB
3                   247       47   9.999900e-01     BBB
4                 19526       50  1.390036e+164     AAA
5                 56223       63  2.225074e-308     AAA
6                    42       15   7.357397e+15     BBB
7                    97       62   9.999900e-01     CCC
8                     0       69   3.293796e+09     AAA
9   9216616637413720064        4   1.000000e+07     AAA
10    23090105669335094       14   5.397605e-78     CCC
11                    0       50   1.192093e-07     CCC
12           1260840409       98   1.500000e+00     AAA
13                21966       68   1.100000e+00     AAA
14                23289       21   3.333333e-01     CCC
15   912854047966763290       27   6.519203e+16     BBB
16  8876389219764502267        9  5.706631e-178     CCC
17                40004       40   1.500000e+00     CCC
18                  247       77   5.742309e+16     BBB
19                47285       17   1.175494e-38     AAA
cosmicBboy commented 1 month ago

Okay, so it seems like generating smaller dataframes yields higher entropy results:

print(schema.example(size=5))

# generates different datasets
               column1  column2        column3 column4
0                  152        1   9.007199e+15     BBB
1  9223372036854775807        1   1.192093e-07     CCC
2  4148323564460896226       56   6.189641e+16     BBB
3                  123       83   6.103516e-05     CCC
4                32240        2  1.112537e-308     BBB
print(schema.example(size=10))

# we see this consistently
   column1  column2  column3 column4
0    31078        1      0.0     AAA
1        0        1      0.0     AAA
2        0        1      0.0     AAA
3        0        1      0.0     AAA
4        0        1      0.0     AAA
5        0        1      0.0     AAA
6        0        1      0.0     AAA
7        0        1      0.0     AAA
8        0        1      0.0     AAA
9        0        1      0.0     AAA

@tmcclintock recommendations would be:

@Zac-HD any ideas on how to address this? On the pandera side, it would make sense to collect all the schema statistics and combine them all into a single element strategy so we don't have to rely on filter, but that'll require a larger refactoring project.

Zac-HD commented 1 month ago
  1. Check whether you see more-diverse outputs if you actually run the test? Strategies' .example() method often biases simpler (for complicated internal reasons), and dataframes are typically 'sparse' as well - so you might get a fill-value and then few-or-no other values.
  2. Eventually you're going to have to do that project, yeah. The filter-rewriting should be able to handle this case though, so I suspect that there's a simpler fix for this specific issue somewhere in Pandera.
tmcclintock commented 1 week ago

Thanks, both. Feel free to close this issue if you feel like it. FWIW, IMO the .example() API of pandera is one of its strongest features. I'd love for it to be performant one day!

cosmicBboy commented 1 week ago

It might make sense to bring back the warning that hypothesis raises with example. It's really meant more for interactively debugging and examining strategies, and not for any serious production context. The intended use of it really is as demonstrated here https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html#usage-in-unit-tests.

tmcclintock commented 1 week ago

Yup, that's what I use it for. Pandera let's me mock entire machine learning pipelines. The issue is, usually I want more than 5 rows of mock data :).

cosmicBboy commented 1 week ago

closing this issue, @tmcclintock FYI I created https://github.com/unionai-oss/pandera/issues/1625 to articulate what would be needed to improve the performance of pandera strategies overall.