unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.05k stars 281 forks source link

Custom check fails with `pl.DataFrame` #1565

Closed philiporlando closed 1 month ago

philiporlando commented 1 month ago

Code Sample, a copy-pastable example

I'm seeing an error when trying to perform schema validation with custom checks on a polars dataframe: pandera.errors.SchemaError: ComputeError("custom python function failed: cannot unpack series of type 'str' into 'bool'")

import polars as pl
import pandera.polars as pa

# Custom check function
def check_len(v: str) -> bool:
    return len(v) <= 20

schema = pa.DataFrameSchema(
    {
        "phone": pa.Column(
            dtype=str,
            checks=pa.Check.str_matches(r"\(\d{3}\) \d{3}-\d{4}"),
        ),
        "fruit": pa.Column(
            dtype=str,
            checks=pa.Check(check_len, element_wise=True),
        ),
    }
)

df = pl.DataFrame(
    {
        "phone": ["(123) 234-2342", "(213) 234-2345", "(213) 234-2345"],
        "fruit": ["apple", "pear", "banana"],
    }
)

df.pipe(schema.validate)
# C:\local\.venv\Lib\site-packages\pandera\backends\polars\base.py:74: MapWithoutReturnDtypeWarning: Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning.
#   passed = check_result.check_passed.collect().item()
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "C:\local\.venv\Lib\site-packages\polars\dataframe\frame.py", line 5150, in pipe
#     return function(self, *args, **kwargs)
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\api\polars\container.py", line 58, in validate
#     output = self.get_backend(check_obj).validate(
#              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\backends\polars\container.py", line 114, in validate
#     error_handler.collect_error(
#   File "C:\local\.venv\Lib\site-packages\pandera\api\base\error_handler.py", line 54, in collect_error
#     raise schema_error from original_exc
#   File "C:\local\.venv\Lib\site-packages\pandera\backends\polars\container.py", line 182, in run_schema_component_checks
#     result = schema_component.validate(check_obj, lazy=lazy)
#              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\api\polars\components.py", line 141, in validate
#     output = self.get_backend(check_obj).validate(
#              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\backends\polars\components.py", line 81, in validate
#     error_handler = self.run_checks_and_handle_errors(
#                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\backends\polars\components.py", line 147, in run_checks_and_handle_errors
#     error_handler.collect_error(
#   File "C:\local\.venv\Lib\site-packages\pandera\api\base\error_handler.py", line 54, in collect_error
#     raise schema_error from original_exc
# pandera.errors.SchemaError: ComputeError("custom python function failed: cannot unpack series of type `str` into `bool`")```

Removing the phone check will avoid this error, but then a warning about a missing return_dtype value is returned:

import polars as pl
import pandera.polars as pa

# Custom check function
def check_len(v: str) -> bool:
    return len(v) <= 20

schema = pa.DataFrameSchema(
    {
        "fruit": pa.Column(
            dtype=str,
            checks=pa.Check(check_len, element_wise=True),
        ),
    }
)

df = pl.DataFrame(
    {
        "fruit": ["apple", "pear", "banana"],
    }
)

df.pipe(schema.validate)
# C:\local\.venv\Lib\site-packages\pandera\backends\polars\base.py:74: MapWithoutReturnDtypeWarning: Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning.
#   passed = check_result.check_passed.collect().item()
# C:\local\.venv\Lib\site-packages\pandera\backends\polars\base.py:112: MapWithoutReturnDtypeWarning: Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning.
#   check_output=check_result.check_output.collect(),
# shape: (3, 1)
# ┌────────┐
# │ fruit  │
# │ ---    │
# │ str    │
# ╞════════╡
# │ apple  │
# │ pear   │
# │ banana │
# └────────┘

Expected behavior

Schema validation should run successfully for this DataFrame when custom checks are included alongside built-in check methods.

Desktop (please complete the following information):

philiporlando commented 1 month ago

~It's difficult to trust that custom checks are actually working with the LazyFrame in this reprex after discovering #1566.~

Edit: LazyFrame validation does not apply to data-level checks by design.

cosmicBboy commented 1 month ago

@philiporlando see #1572

philiporlando commented 1 month ago

@cosmicBboy, many thanks for resolving this so quickly!