Closed tmcclintock closed 1 week ago
Looks like this is an issue with the way pandera strategies tries to chain together multiple checks, e.g.
schema = DataFrameSchema(
{
"column1": Column(int, Check.ge(0)),
"column2": Column(int, [Check.in_range(1, 100)]), # 👈 use a single in_range check instead of ge and le
"column3": Column(float, Check.ge(0)),
"column4": Column(str, Check.isin(["AAA", "BBB", "CCC"])),
}
)
produces
6.100.1 0.0.0+dev0
column1 column2 column3 column4
0 0 1 3.402823e+38 AAA
1 0 1 2.882304e+16 CCC
2 0 6 2.000010e+00 BBB
3 247 47 9.999900e-01 BBB
4 19526 50 1.390036e+164 AAA
5 56223 63 2.225074e-308 AAA
6 42 15 7.357397e+15 BBB
7 97 62 9.999900e-01 CCC
8 0 69 3.293796e+09 AAA
9 9216616637413720064 4 1.000000e+07 AAA
10 23090105669335094 14 5.397605e-78 CCC
11 0 50 1.192093e-07 CCC
12 1260840409 98 1.500000e+00 AAA
13 21966 68 1.100000e+00 AAA
14 23289 21 3.333333e-01 CCC
15 912854047966763290 27 6.519203e+16 BBB
16 8876389219764502267 9 5.706631e-178 CCC
17 40004 40 1.500000e+00 CCC
18 247 77 5.742309e+16 BBB
19 47285 17 1.175494e-38 AAA
Okay, so it seems like generating smaller dataframes yields higher entropy results:
print(schema.example(size=5))
# generates different datasets
column1 column2 column3 column4
0 152 1 9.007199e+15 BBB
1 9223372036854775807 1 1.192093e-07 CCC
2 4148323564460896226 56 6.189641e+16 BBB
3 123 83 6.103516e-05 CCC
4 32240 2 1.112537e-308 BBB
print(schema.example(size=10))
# we see this consistently
column1 column2 column3 column4
0 31078 1 0.0 AAA
1 0 1 0.0 AAA
2 0 1 0.0 AAA
3 0 1 0.0 AAA
4 0 1 0.0 AAA
5 0 1 0.0 AAA
6 0 1 0.0 AAA
7 0 1 0.0 AAA
8 0 1 0.0 AAA
9 0 1 0.0 AAA
@tmcclintock recommendations would be:
@Zac-HD any ideas on how to address this? On the pandera side, it would make sense to collect all the schema statistics and combine them all into a single element strategy so we don't have to rely on filter
, but that'll require a larger refactoring project.
.example()
method often biases simpler (for complicated internal reasons), and dataframes are typically 'sparse' as well - so you might get a fill-value and then few-or-no other values.Thanks, both. Feel free to close this issue if you feel like it. FWIW, IMO the .example()
API of pandera is one of its strongest features. I'd love for it to be performant one day!
It might make sense to bring back the warning that hypothesis
raises with example
. It's really meant more for interactively debugging and examining strategies, and not for any serious production context. The intended use of it really is as demonstrated here https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html#usage-in-unit-tests.
Yup, that's what I use it for. Pandera let's me mock entire machine learning pipelines. The issue is, usually I want more than 5 rows of mock data :).
closing this issue, @tmcclintock FYI I created https://github.com/unionai-oss/pandera/issues/1625 to articulate what would be needed to improve the performance of pandera strategies overall.
Describe the bug Calling
schema.example()
generatessize
number of identical rows. This is not desirable, since the whole purpose ofpandera
+hypothesis
is to create rich examples with very a lot of variety in the rows.Code Sample, a copy-pastable example
yields
Note that both the pandera and hypothesis versions are latest.
Expected behavior
All 20 columns look very different
Desktop (please complete the following information):
python --version --version
yieldsAdditional context
I suspected this is related to #1503, but after doing
pip install pandera==0.18.0 hypothesis
the issue still persisted. Same thing when I dropped down to pandera 0.17.2.Also the same behavior occurs when using
DataFrameModel
s.The only time I can get high-entropy rows is when
checks
is aCheck
and not alist[Check]
like this:The output of this is as expected:
However this is wayyy less useful then having all the checks apply.