unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.05k stars 281 forks source link

Custom check erroneously passes when validating `pl.LazyFrame` #1566

Closed philiporlando closed 1 month ago

philiporlando commented 1 month ago

Code Sample, a copy-pastable example

I've created a custom check function that should never return True based on my sample data. However, pandera does not raise an error when validating the fruit column. This may be related to #1565.

import polars as pl
import pandera.polars as pa

# Custom check function
def check_len(v: str) -> bool:
    return len(v) == 20

schema = pa.DataFrameSchema(
    {
        "fruit": pa.Column(
            dtype=str,
            checks=pa.Check(check_len, element_wise=True),
        ),
    }
)

lf = pl.LazyFrame(
    {
        "fruit": ["apple", "pear", "banana"],
    }
)

lf.pipe(schema.validate).collect()
# shape: (3, 1)
# ┌────────┐
# │ fruit  │
# │ ---    │
# │ str    │
# ╞════════╡
# │ apple  │
# │ pear   │
# │ banana │
# └────────┘

Converting from LazyFrame to DataFrame before performing the schema validation appears to raise the expected error:

import polars as pl
import pandera.polars as pa

# Custom check function
def check_len(v: str) -> bool:
    return len(v) == 20

schema = pa.DataFrameSchema(
    {
        "fruit": pa.Column(
            dtype=str,
            checks=pa.Check(check_len, element_wise=True),
        ),
    }
)

lf = pl.LazyFrame(
    {
        "fruit": ["apple", "pear", "banana"],
    }
)

df = lf.collect()
df.pipe(schema.validate)
# C:\local\.venv\Lib\site-packages\pandera\backends\polars\base.py:74: MapWithoutReturnDtypeWarning: Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning.
#   passed = check_result.check_passed.collect().item()
# C:\local\.venv\Lib\site-packages\pandera\backends\polars\base.py:88: MapWithoutReturnDtypeWarning: Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning.
#   failure_cases = check_result.failure_cases.collect()
# C:\local\.venv\Lib\site-packages\pandera\backends\polars\base.py:112: MapWithoutReturnDtypeWarning: Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning.
#   check_output=check_result.check_output.collect(),
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "C:\local\.venv\Lib\site-packages\polars\dataframe\frame.py", line 5150, in pipe
#     return function(self, *args, **kwargs)
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\api\polars\container.py", line 58, in validate
#     output = self.get_backend(check_obj).validate(
#              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\backends\polars\container.py", line 114, in validate
#     error_handler.collect_error(
#   File "C:\local\.venv\Lib\site-packages\pandera\api\base\error_handler.py", line 54, in collect_error
#     raise schema_error from original_exc
#   File "C:\local\.venv\Lib\site-packages\pandera\backends\polars\container.py", line 182, in run_schema_component_checks
#     result = schema_component.validate(check_obj, lazy=lazy)
#              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\api\polars\components.py", line 141, in validate
#     output = self.get_backend(check_obj).validate(
#              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\backends\polars\components.py", line 81, in validate
#     error_handler = self.run_checks_and_handle_errors(
#                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\.venv\Lib\site-packages\pandera\backends\polars\components.py", line 147, in run_checks_and_handle_errors
#     error_handler.collect_error(
#   File "C:\local\.venv\Lib\site-packages\pandera\api\base\error_handler.py", line 54, in collect_error
#     raise schema_error from original_exc
# pandera.errors.SchemaError: Column 'fruit' failed validator number 0: <Check check_len> failure case examples: [{'fruit': 'apple'}, {'fruit': 'pear'}, {'fruit': 'banana'}]

Expected behavior

I would expect to see a schema validation error raised with the LazyFrame here since none of the fruit values have a string length of 20 characters.

Desktop (please complete the following information):

cosmicBboy commented 1 month ago

See the docs here https://pandera.readthedocs.io/en/latest/polars.html#error-reporting

This is intended behavior: LazyFrame validation will only to schema-level checks (so as not to materialize the data in a lazy method chain). Currently, pandera assumes that all custom checks operate on data. You can force data-level checks by explicitly setting export PANDERA_VALIDATION_ENABLED=SCHEMA_AND_DATA.

cosmicBboy commented 1 month ago

Is this a duplicate of #1565?

philiporlando commented 1 month ago

See the docs here https://pandera.readthedocs.io/en/latest/polars.html#error-reporting

This is intended behavior: LazyFrame validation will only to schema-level checks (so as not to materialize the data in a lazy method chain). Currently, pandera assumes that all custom checks operate on data. You can force data-level checks by explicitly setting export PANDERA_VALIDATION_ENABLED=SCHEMA_AND_DATA.

This is super helpful and makes total sense. Thanks for the feedback.

philiporlando commented 1 month ago

Is this a duplicate of #1565?

I don't think so. The error that I'm experiencing in #1565 is specific to pl.DataFrame.

cosmicBboy commented 1 month ago

Gotcha, yeah looks like a bug, looking.

cosmicBboy commented 1 month ago

@philiporlando would it make sense to add some logging at validation time to explicitly say what types of checks are being run? If so, would it make sense as logging.info, debug or something else?

philiporlando commented 1 month ago

@philiporlando would it make sense to add some logging at validation time to explicitly say what types of checks are being run? If so, would it make sense as logging.info, debug or something else?

I'm in favor of this! At the very least, I think it would be helpful to communicate which data-level checks are ignored whenever a LazyFrame is validated instead of a DataFrame. It might even make sense to log a warning here?

philiporlando commented 1 month ago

Gotcha, yeah looks like a bug, looking.

Thank you for looking into it!