unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.37k stars 310 forks source link

pandas Dataframe with 1 row results in AssertionError during call to validate #625

Closed tfwillems closed 3 years ago

tfwillems commented 3 years ago

Describe the bug

When calling validate for a SchemaModel on a pandas's data frame with only 1 row, an AssertionError is raised. This appears to be triggered when coerce=True is includex in the field definition, but seems like a simple fix as detailed below

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

from pandera.typing import Series
import pandas as pd
import pandera as pa

class Model(pa.SchemaModel):
    x: Series[int] = pa.Field(coerce=True)

Model.validate(pd.DataFrame({"x": [None]}))

# Results in the following error
# ...
# File "/cluster/home/willems/MethodsDev/vcgt/garbnewenv/pandera-fix/lib/python3.9/site-packages/pandera/engines/numpy_engine.py", line 57, in coerce
#    failure_cases=utils.numpy_pandas_coerce_failure_cases(
#  File "/cluster/home/willems/MethodsDev/vcgt/garbnewenv/pandera-fix/lib/python3.9/site-# packages/pandera/engines/utils.py", line 88, in numpy_pandas_coerce_failure_cases
#    check_output = numpy_pandas_coercible(data_container, type_)
#  File "/cluster/home/willems/MethodsDev/vcgt/garbnewenv/pandera-fix/lib/python3.9/site-packages/pandera/engines/utils.py", line 33, in numpy_pandas_coercible
#    search_list = _bisect(series)
#   File "/cluster/home/willems/MethodsDev/vcgt/garbnewenv/pandera-fix/lib/python3.9/site- packages/pandera/engines/utils.py", line 20, in _bisect
#    assert (
# AssertionError: cannot bisect a pandas Series of length < 2

Expected behavior

The above validation should fail with a SchemaError instead of an AssertionError

Desktop (please complete the following information):

python 3.9.1 pandas 1.3.1 pandera 0.7.1

Additional context

Seems like the numpy_pandas_coercible() fn is missing a base case for a series that contains only 1 element on line 33:

# Current call which triggers the AssertionError as _bisect immediate checks that series.size >= 2
search_list = _bisect(series)

# Proposed fix
search_list =[[series.iloc[0]] if series.size == 1 else _bisect(series)

# I'm not sure if adding a check before line 33 for an empty series is also necessary in case this function is (mis)called on an empty series
if series.empty:
   return pd.Series(dtype="bool")
cosmicBboy commented 3 years ago

doh! nice catch

cosmicBboy commented 3 years ago

626 should address this

tfwillems commented 3 years ago

Great! Feel free to close this whenever #626 is merged