Closed mattharrison closed 2 years ago
hey @mattharrison, is there a particular reason you're generating more and more examples .example(i)
as you include more and more columns?
hypothesis
is doing all the heavy lifting generating the dataframes, and the more examples it has to generate the more time it needs. I believe one can increase the deadline
setting, which is basically a timeout for generating examples, to give it more time to generate examples: https://hypothesis.readthedocs.io/en/latest/settings.html#hypothesis.settings.deadline
It'd also be worth documenting the recommendation that generating more than 50 rows of data is a lot to handle to pandera/hypothesis... basically the purpose of this synthetic data is for unit testing, which typically won't involve large datasets.
Good catch. I changed the i
to a 1
import pandas as pd
import pandera as pa
import time
url = 'https://github.com/mattharrison/datasets/blob/master/data/ames-housing-dataset.zip?raw=true'
ames = pd.read_csv(url, compression='zip')
s = pa.infer_schema(ames)
for i in range(80, 81):
start = time.time()
s.select_columns(list(s.columns.keys())[:i]).example(1)
print(f'{i} took {time.time()-start} seconds {list(s.columns.keys())[:i]}')
I also changed
for i in range(1, 80):
to
for i in range(80, 81):
And it failed to generate an example:
------------------------------------------------------------------
Unsatisfiable Traceback (most recent call last)
<ipython-input-106-b6da2b8d41e0> in <module>
8 for i in range(80, 81):
9 start = time.time()
---> 10 s.select_columns(list(s.columns.keys())[:i]).example(1)
11 print(f'{i} took {time.time()-start} seconds {list(s.columns.keys())[:i]}')
~/envs/menv/lib/python3.8/site-packages/pandera/schemas.py in example(self, size, n_regex_columns)
759 category=hypothesis.errors.NonInteractiveExampleWarning,
760 )
--> 761 return self.strategy(
762 size=size, n_regex_columns=n_regex_columns
763 ).example()
~/envs/menv/lib/python3.8/site-packages/hypothesis/strategies/_internal/strategies.py in example(self)
322
323 examples: List[Ex] = []
--> 324 example_generating_inner_function()
325 return random_choice(examples)
326
~/envs/menv/lib/python3.8/site-packages/hypothesis/strategies/_internal/strategies.py in example_generating_inner_function()
310 # tracebacks, and we want users to know that they can ignore it.
311 @given(self)
--> 312 @settings(
313 database=None,
314 max_examples=10,
[... skipping hidden 2 frame]
~/envs/menv/lib/python3.8/site-packages/hypothesis/core.py in run_engine(self)
770 else:
771 if runner.valid_examples == 0:
--> 772 raise Unsatisfiable(
773 "Unable to satisfy assumptions of hypothesis %s."
774 % (get_pretty_function_description(self.test),)
Unsatisfiable: Unable to satisfy assumptions of hypothesis example_generating_inner_function.
I also tried inferring from a sample (100 rows) of the data and got a different error:
import pandas as pd
import pandera as pa
import time
url = 'https://github.com/mattharrison/datasets/blob/master/data/ames-housing-dataset.zip?raw=true'
ames = pd.read_csv(url, compression='zip')
s = pa.infer_schema(ames.sample(100, random_state=42))
#for i in range(1, 20):
for i in range(80, 81):
start = time.time()
s.select_columns(list(s.columns.keys())[:i]).example(1)
print(f'{i} took {time.time()-start} seconds {list(s.columns.keys())[:i]}')
------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/envs/menv/lib/python3.8/site-packages/pandera/engines/pandas_engine.py in dtype(cls, data_type)
122 try:
--> 123 return engine.Engine.dtype(cls, data_type)
124 except TypeError:
~/envs/menv/lib/python3.8/site-packages/pandera/engines/engine.py in dtype(cls, data_type)
210 except (KeyError, ValueError):
--> 211 raise TypeError(
212 f"Data type '{data_type}' not understood by {cls.__name__}."
TypeError: Data type 'empty' not understood by Engine.
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-110-9c448b1fadcb> in <module>
3 url = 'https://github.com/mattharrison/datasets/blob/master/data/ames-housing-dataset.zip?raw=true'
4 ames = pd.read_csv(url, compression='zip')
----> 5 s = pa.infer_schema(ames.sample(100, random_state=42))
6 import time
7 #for i in range(1, 20):
~/envs/menv/lib/python3.8/site-packages/pandera/schema_inference.py in infer_schema(pandas_obj)
24 """
25 if isinstance(pandas_obj, pd.DataFrame):
---> 26 return infer_dataframe_schema(pandas_obj)
27 elif isinstance(pandas_obj, pd.Series):
28 return infer_series_schema(pandas_obj)
~/envs/menv/lib/python3.8/site-packages/pandera/schema_inference.py in infer_dataframe_schema(df)
58 :returns: DataFrameSchema
59 """
---> 60 df_statistics = infer_dataframe_statistics(df)
61 schema = DataFrameSchema(
62 columns={
~/envs/menv/lib/python3.8/site-packages/pandera/schema_statistics.py in infer_dataframe_statistics(df)
13 """Infer column and index statistics from a pandas DataFrame."""
14 nullable_columns = df.isna().any()
---> 15 inferred_column_dtypes = {col: _get_array_type(df[col]) for col in df}
16 column_statistics = {
17 col: {
~/envs/menv/lib/python3.8/site-packages/pandera/schema_statistics.py in <dictcomp>(.0)
13 """Infer column and index statistics from a pandas DataFrame."""
14 nullable_columns = df.isna().any()
---> 15 inferred_column_dtypes = {col: _get_array_type(df[col]) for col in df}
16 column_statistics = {
17 col: {
~/envs/menv/lib/python3.8/site-packages/pandera/schema_statistics.py in _get_array_type(x)
182 inferred_alias = pd.api.types.infer_dtype(x, skipna=True)
183 if inferred_alias != "string":
--> 184 data_type = pandas_engine.Engine.dtype(inferred_alias)
185 return data_type
186
~/envs/menv/lib/python3.8/site-packages/pandera/engines/pandas_engine.py in dtype(cls, data_type)
139 # let pandas transform any acceptable value
140 # into a numpy or pandas dtype.
--> 141 np_or_pd_dtype = pd.api.types.pandas_dtype(data_type)
142 if isinstance(np_or_pd_dtype, np.dtype):
143 np_or_pd_dtype = np_or_pd_dtype.type
~/envs/menv/lib/python3.8/site-packages/pandas/core/dtypes/common.py in pandas_dtype(dtype)
1779 # raise a consistent TypeError if failed
1780 try:
-> 1781 npdtype = np.dtype(dtype)
1782 except SyntaxError as err:
1783 # np.dtype uses `eval` which can raise SyntaxError
TypeError: data type 'empty' not understood
okay, I identified the promixal issue here: https://github.com/unionai-oss/pandera/pull/989
Generating an example on the entire schema works up to 16-ish examples (it craps out at 32):
import pandas as pd
import pandera as pa
import time
from datetime import timedelta
from hypothesis import settings
url = 'https://github.com/mattharrison/datasets/blob/master/data/ames-housing-dataset.zip?raw=true'
ames = pd.read_csv(url, compression='zip')
s = pa.infer_schema(ames)
for i in [0, 1, 2, 4, 8, 16, 32]:
start = time.time()
s.example(i)
print(f'{i} examples took {time.time()-start} seconds')
Output:
0 examples took 0.049291133880615234 seconds
1 examples took 42.040018796920776 seconds
2 examples took 42.64928913116455 seconds
4 examples took 42.541467905044556 seconds
8 examples took 40.94924283027649 seconds
16 examples took 41.02593684196472 seconds
Traceback (most recent call last):
File "/Users/nielsbantilan/git/pandera/foo.py", line 15, in <module>
s.example(i)
File "/Users/nielsbantilan/git/pandera/pandera/schemas.py", line 945, in example
return self.strategy(
File "/Users/nielsbantilan/miniconda3/envs/pandera/lib/python3.9/site-packages/hypothesis/strategies/_internal/strategies.py", line 335, in example
example_generating_inner_function()
File "/Users/nielsbantilan/miniconda3/envs/pandera/lib/python3.9/site-packages/hypothesis/strategies/_internal/strategies.py", line 324, in example_generating_inner_function
@settings(
File "/Users/nielsbantilan/miniconda3/envs/pandera/lib/python3.9/site-packages/hypothesis/core.py", line 1235, in wrapped_test
raise the_error_hypothesis_found
File "/Users/nielsbantilan/miniconda3/envs/pandera/lib/python3.9/site-packages/hypothesis/core.py", line 815, in run_engine
raise Unsatisfiable(f"Unable to satisfy assumptions of {rep}")
hypothesis.errors.Unsatisfiable: Unable to satisfy assumptions of example_generating_inner_function
Cutting a bugfix release 0.13.4
next week, this should be included in there!
Ok, trying this again with another dataset and running into issues.
This fails. Note I'm not every creating all of the columns (though I would like to), just two floating point columns.
Code:
import pandas as pd
import pandera as pa
raw = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/alta-noaa-1980-2019.csv',
parse_dates=['DATE'])
pa.infer_schema(raw.iloc[:,3:4]).example(size=5)
Should I open another bug?
Describe the bug
I'm inferring the schema from a CSV with 83 columns. When I try to generate an example it fails.
Unsatisfiable: Unable to satisfy assumptions of hypothesis example_generating_inner_function.
Code Sample, a copy-pastable example
Expected behavior
A clear and concise description of what you expected to happen.
I would expect this to generate an example. I made a simple script to measure timing when adding columns (of int, str, and float) and it works with 80 columns:
Desktop (please complete the following information):