unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.34k stars 308 forks source link

str_length check raises an DispatchError exception if min_value or max_value are not set #1406

Open tfwillems opened 11 months ago

tfwillems commented 11 months ago

Describe the bug A clear and concise description of what the bug is.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

import pandas as pd
import pytest
from pandera import Column, DataFrameSchema
from pandera.api.checks import Check
from pandera.errors import SchemaError

# We wish to validate data frames with 1 column (x) and check that each value's length >= 2
pass_df = pd.DataFrame({"x": ["ab", "acde", "afg"], "y": [5, 6, 7]}) # Should pass validation
fail_df = pd.DataFrame({"x": ["ab", "abc", "d"]}) # Should fail validation

schema = DataFrameSchema(
    columns={
        "x": Column(str, checks=[Check.str_length(min_value=2)]),
    },
)

# Either of these will raise the exception:
# multimethod.DispatchError: ('str_length: 0 methods found', (<class 'pandas.core.series.Series'>, <class 'int'>, <class 'NoneType'>), [])
# This indicates a failure to identify the correct check method, likely b/c of the default None value for max_value
schema.validate(pass_df)
schema.validate(fail_df)

# The code below works because both min_value and max_value are set and dispatch succeeds
schema = DataFrameSchema(
    columns={
        "x": Column(str, checks=[Check.str_length(min_value=2, max_value=100)]),
    }
)
schema.validate(pass_df)  # This works as expected, because max_value is specified

with pytest.raises(SchemaError):  # This correctly triggers a schema error
    schema.validate(fail_df)

Expected behavior

I expected the schema validation to either pass or raise a SchemaError , instead of the multimethod dDspatchError

Desktop (please complete the following information):

Additional context

I was able to troubleshoot and fix the issue by changing the type annotations for the str_length method in pandera.api.checks, pandera.backends.base.builtin_checks and pandera/backends/pandas/builtin_checks.

In all 3 settings, min_value and max_value have an annotated type of int, but should be annotated as Optional[int] instead given the default value =None. Making these changes enabled multimethod to correctly identify the check function when no value was specified for min_value or max_value.

I will try and submit a MR if time allows in the next few weeks, but currently don't have the bandwidth to do so. Thanks for maintaining such a great library!

dwinski commented 6 months ago

I am getting this same DispatchError for str_length with pandas after installing pandera 0.18.0 in a situation where min_value =1 and max_value is null.