Inferred schema fails to generate example

mattharrison commented 2 years ago

Describe the bug

I'm inferring the schema from a CSV with 83 columns. When I try to generate an example it fails.

Unsatisfiable: Unable to satisfy assumptions of hypothesis example_generating_inner_function.

Code Sample, a copy-pastable example

import pandas as pd
import pandera as pa
import time

url = 'https://github.com/mattharrison/datasets/blob/master/data/ames-housing-dataset.zip?raw=true'
ames = pd.read_csv(url, compression='zip')
s = pa.infer_schema(ames)

for i in range(1, 80):
    start = time.time()
    s.select_columns(list(s.columns.keys())[:i]).example(i)
    print(f'{i} took {time.time()-start} seconds {list(s.columns.keys())[:i]}')

Expected behavior

A clear and concise description of what you expected to happen.

I would expect this to generate an example. I made a simple script to measure timing when adding columns (of int, str, and float) and it works with 80 columns:

for i in range(80):
    cols = {}
    for y in range(i):
        if y % 3 == 0:
            cols[f'col{y}'] = pa.Column(int)
        if y % 3 == 1:
            cols[f'col{y}'] = pa.Column(float)
        if y % 3 == 2:
            cols[f'col{y}'] = pa.Column(str)
    schema = pa.DataFrameSchema(cols)
    start = time.time()
    schema.example()#s.select_columns(list(s.columns.keys())[:i]).example(i)
    print(f'{i} took {time.time()-start} seconds')

Desktop (please complete the following information):

OS: WSL2 Ubuntu 20.4
Python Version: 3.8
Pandera Version: 0.7.0

cosmicBboy commented 2 years ago

hey @mattharrison, is there a particular reason you're generating more and more examples .example(i) as you include more and more columns?

hypothesis is doing all the heavy lifting generating the dataframes, and the more examples it has to generate the more time it needs. I believe one can increase the deadline setting, which is basically a timeout for generating examples, to give it more time to generate examples: https://hypothesis.readthedocs.io/en/latest/settings.html#hypothesis.settings.deadline

It'd also be worth documenting the recommendation that generating more than 50 rows of data is a lot to handle to pandera/hypothesis... basically the purpose of this synthetic data is for unit testing, which typically won't involve large datasets.

mattharrison commented 2 years ago

Good catch. I changed the i to a 1

import pandas as pd
import pandera as pa
import time

url = 'https://github.com/mattharrison/datasets/blob/master/data/ames-housing-dataset.zip?raw=true'
ames = pd.read_csv(url, compression='zip')
s = pa.infer_schema(ames)

for i in range(80, 81):
    start = time.time()
    s.select_columns(list(s.columns.keys())[:i]).example(1)
    print(f'{i} took {time.time()-start} seconds {list(s.columns.keys())[:i]}')

I also changed

for i in range(1, 80):

to

for i in range(80, 81):

And it failed to generate an example:

------------------------------------------------------------------
Unsatisfiable                    Traceback (most recent call last)
<ipython-input-106-b6da2b8d41e0> in <module>
      8 for i in range(80, 81):
      9     start = time.time()
---> 10     s.select_columns(list(s.columns.keys())[:i]).example(1)
     11     print(f'{i} took {time.time()-start} seconds {list(s.columns.keys())[:i]}')

~/envs/menv/lib/python3.8/site-packages/pandera/schemas.py in example(self, size, n_regex_columns)
    759                 category=hypothesis.errors.NonInteractiveExampleWarning,
    760             )
--> 761             return self.strategy(
    762                 size=size, n_regex_columns=n_regex_columns
    763             ).example()

~/envs/menv/lib/python3.8/site-packages/hypothesis/strategies/_internal/strategies.py in example(self)
    322 
    323         examples: List[Ex] = []
--> 324         example_generating_inner_function()
    325         return random_choice(examples)
    326 

~/envs/menv/lib/python3.8/site-packages/hypothesis/strategies/_internal/strategies.py in example_generating_inner_function()
    310         # tracebacks, and we want users to know that they can ignore it.
    311         @given(self)
--> 312         @settings(
    313             database=None,
    314             max_examples=10,

    [... skipping hidden 2 frame]

~/envs/menv/lib/python3.8/site-packages/hypothesis/core.py in run_engine(self)
    770         else:
    771             if runner.valid_examples == 0:
--> 772                 raise Unsatisfiable(
    773                     "Unable to satisfy assumptions of hypothesis %s."
    774                     % (get_pretty_function_description(self.test),)

Unsatisfiable: Unable to satisfy assumptions of hypothesis example_generating_inner_function.

mattharrison commented 2 years ago

I also tried inferring from a sample (100 rows) of the data and got a different error:

import pandas as pd
import pandera as pa
import time

url = 'https://github.com/mattharrison/datasets/blob/master/data/ames-housing-dataset.zip?raw=true'
ames = pd.read_csv(url, compression='zip')
s = pa.infer_schema(ames.sample(100, random_state=42))

#for i in range(1, 20):
for i in range(80, 81):
    start = time.time()
    s.select_columns(list(s.columns.keys())[:i]).example(1)
    print(f'{i} took {time.time()-start} seconds {list(s.columns.keys())[:i]}')

------------------------------------------------------------------
TypeError                        Traceback (most recent call last)
~/envs/menv/lib/python3.8/site-packages/pandera/engines/pandas_engine.py in dtype(cls, data_type)
    122         try:
--> 123             return engine.Engine.dtype(cls, data_type)
    124         except TypeError:

~/envs/menv/lib/python3.8/site-packages/pandera/engines/engine.py in dtype(cls, data_type)
    210         except (KeyError, ValueError):
--> 211             raise TypeError(
    212                 f"Data type '{data_type}' not understood by {cls.__name__}."

TypeError: Data type 'empty' not understood by Engine.

During handling of the above exception, another exception occurred:

TypeError                        Traceback (most recent call last)
<ipython-input-110-9c448b1fadcb> in <module>
      3 url = 'https://github.com/mattharrison/datasets/blob/master/data/ames-housing-dataset.zip?raw=true'
      4 ames = pd.read_csv(url, compression='zip')
----> 5 s = pa.infer_schema(ames.sample(100, random_state=42))
      6 import time
      7 #for i in range(1, 20):

~/envs/menv/lib/python3.8/site-packages/pandera/schema_inference.py in infer_schema(pandas_obj)
     24     """
     25     if isinstance(pandas_obj, pd.DataFrame):
---> 26         return infer_dataframe_schema(pandas_obj)
     27     elif isinstance(pandas_obj, pd.Series):
     28         return infer_series_schema(pandas_obj)

~/envs/menv/lib/python3.8/site-packages/pandera/schema_inference.py in infer_dataframe_schema(df)
     58     :returns: DataFrameSchema
     59     """
---> 60     df_statistics = infer_dataframe_statistics(df)
     61     schema = DataFrameSchema(
     62         columns={

~/envs/menv/lib/python3.8/site-packages/pandera/schema_statistics.py in infer_dataframe_statistics(df)
     13     """Infer column and index statistics from a pandas DataFrame."""
     14     nullable_columns = df.isna().any()
---> 15     inferred_column_dtypes = {col: _get_array_type(df[col]) for col in df}
     16     column_statistics = {
     17         col: {

~/envs/menv/lib/python3.8/site-packages/pandera/schema_statistics.py in <dictcomp>(.0)
     13     """Infer column and index statistics from a pandas DataFrame."""
     14     nullable_columns = df.isna().any()
---> 15     inferred_column_dtypes = {col: _get_array_type(df[col]) for col in df}
     16     column_statistics = {
     17         col: {

~/envs/menv/lib/python3.8/site-packages/pandera/schema_statistics.py in _get_array_type(x)
    182         inferred_alias = pd.api.types.infer_dtype(x, skipna=True)
    183         if inferred_alias != "string":
--> 184             data_type = pandas_engine.Engine.dtype(inferred_alias)
    185     return data_type
    186 

~/envs/menv/lib/python3.8/site-packages/pandera/engines/pandas_engine.py in dtype(cls, data_type)
    139                 # let pandas transform any acceptable value
    140                 # into a numpy or pandas dtype.
--> 141                 np_or_pd_dtype = pd.api.types.pandas_dtype(data_type)
    142                 if isinstance(np_or_pd_dtype, np.dtype):
    143                     np_or_pd_dtype = np_or_pd_dtype.type

~/envs/menv/lib/python3.8/site-packages/pandas/core/dtypes/common.py in pandas_dtype(dtype)
   1779     # raise a consistent TypeError if failed
   1780     try:
-> 1781         npdtype = np.dtype(dtype)
   1782     except SyntaxError as err:
   1783         # np.dtype uses `eval` which can raise SyntaxError

TypeError: data type 'empty' not understood

cosmicBboy commented 2 years ago

okay, I identified the promixal issue here: https://github.com/unionai-oss/pandera/pull/989

Generating an example on the entire schema works up to 16-ish examples (it craps out at 32):

import pandas as pd
import pandera as pa
import time
from datetime import timedelta

from hypothesis import settings

url = 'https://github.com/mattharrison/datasets/blob/master/data/ames-housing-dataset.zip?raw=true'
ames = pd.read_csv(url, compression='zip')
s = pa.infer_schema(ames)

for i in [0, 1, 2, 4, 8, 16, 32]:
    start = time.time()
    s.example(i)
    print(f'{i} examples took {time.time()-start} seconds')

Output:

0 examples took 0.049291133880615234 seconds
1 examples took 42.040018796920776 seconds
2 examples took 42.64928913116455 seconds
4 examples took 42.541467905044556 seconds
8 examples took 40.94924283027649 seconds
16 examples took 41.02593684196472 seconds
Traceback (most recent call last):
  File "/Users/nielsbantilan/git/pandera/foo.py", line 15, in <module>
    s.example(i)
  File "/Users/nielsbantilan/git/pandera/pandera/schemas.py", line 945, in example
    return self.strategy(
  File "/Users/nielsbantilan/miniconda3/envs/pandera/lib/python3.9/site-packages/hypothesis/strategies/_internal/strategies.py", line 335, in example
    example_generating_inner_function()
  File "/Users/nielsbantilan/miniconda3/envs/pandera/lib/python3.9/site-packages/hypothesis/strategies/_internal/strategies.py", line 324, in example_generating_inner_function
    @settings(
  File "/Users/nielsbantilan/miniconda3/envs/pandera/lib/python3.9/site-packages/hypothesis/core.py", line 1235, in wrapped_test
    raise the_error_hypothesis_found
  File "/Users/nielsbantilan/miniconda3/envs/pandera/lib/python3.9/site-packages/hypothesis/core.py", line 815, in run_engine
    raise Unsatisfiable(f"Unable to satisfy assumptions of {rep}")
hypothesis.errors.Unsatisfiable: Unable to satisfy assumptions of example_generating_inner_function

Cutting a bugfix release 0.13.4 next week, this should be included in there!

mattharrison commented 1 year ago

Ok, trying this again with another dataset and running into issues.

This fails. Note I'm not every creating all of the columns (though I would like to), just two floating point columns.

Code:

import pandas as pd
import pandera as pa

raw = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/alta-noaa-1980-2019.csv',
                  parse_dates=['DATE'])

pa.infer_schema(raw.iloc[:,3:4]).example(size=5)

Should I open another bug?

unionai-oss / pandera