Describe the bug
A clear and concise description of what the bug is.
[ ] I have checked that this issue has not already been reported.
There are other issues related to str_length reported, but all in the context of pyspark. The issue I'm encountering is just for pandas.
[x] I have confirmed this bug exists on the latest version of pandera.
[ ] (optional) I have confirmed this bug exists on the master branch of pandera.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
import pandas as pd
import pytest
from pandera import Column, DataFrameSchema
from pandera.api.checks import Check
from pandera.errors import SchemaError
# We wish to validate data frames with 1 column (x) and check that each value's length >= 2
pass_df = pd.DataFrame({"x": ["ab", "acde", "afg"], "y": [5, 6, 7]}) # Should pass validation
fail_df = pd.DataFrame({"x": ["ab", "abc", "d"]}) # Should fail validation
schema = DataFrameSchema(
columns={
"x": Column(str, checks=[Check.str_length(min_value=2)]),
},
)
# Either of these will raise the exception:
# multimethod.DispatchError: ('str_length: 0 methods found', (<class 'pandas.core.series.Series'>, <class 'int'>, <class 'NoneType'>), [])
# This indicates a failure to identify the correct check method, likely b/c of the default None value for max_value
schema.validate(pass_df)
schema.validate(fail_df)
# The code below works because both min_value and max_value are set and dispatch succeeds
schema = DataFrameSchema(
columns={
"x": Column(str, checks=[Check.str_length(min_value=2, max_value=100)]),
}
)
schema.validate(pass_df) # This works as expected, because max_value is specified
with pytest.raises(SchemaError): # This correctly triggers a schema error
schema.validate(fail_df)
Expected behavior
I expected the schema validation to either pass or raise a SchemaError , instead of the multimethod dDspatchError
Desktop (please complete the following information):
OS: macOS Monterery
Browser chrome
Additional context
I was able to troubleshoot and fix the issue by changing the type annotations for the str_length
method in pandera.api.checks, pandera.backends.base.builtin_checks and pandera/backends/pandas/builtin_checks.
In all 3 settings, min_value and max_value have an annotated type of int, but should be annotated as Optional[int] instead given the default value =None. Making these changes enabled multimethod to correctly identify the check function when no value was specified for min_value or max_value.
I will try and submit a MR if time allows in the next few weeks, but currently don't have the bandwidth to do so. Thanks for maintaining such a great library!
I am getting this same DispatchError for str_length with pandas after installing pandera 0.18.0 in a situation where min_value =1 and max_value is null.
Describe the bug A clear and concise description of what the bug is.
[ ] I have checked that this issue has not already been reported. There are other issues related to str_length reported, but all in the context of pyspark. The issue I'm encountering is just for pandas.
[x] I have confirmed this bug exists on the latest version of pandera.
[ ] (optional) I have confirmed this bug exists on the master branch of pandera.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Expected behavior
I expected the schema validation to either pass or raise a SchemaError , instead of the multimethod dDspatchError
Desktop (please complete the following information):
Additional context
I was able to troubleshoot and fix the issue by changing the type annotations for the
str_length
method inpandera.api.checks
,pandera.backends.base.builtin_checks
andpandera/backends/pandas/builtin_checks
.In all 3 settings,
min_value
andmax_value
have an annotated type ofint
, but should be annotated asOptional[int]
instead given the default value=None
. Making these changes enabled multimethod to correctly identify the check function when no value was specified formin_value
ormax_value
.I will try and submit a MR if time allows in the next few weeks, but currently don't have the bandwidth to do so. Thanks for maintaining such a great library!