unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.01k stars 273 forks source link

Lazy schema validation does not raise expected errors with polars dataframes #1583

Closed philiporlando closed 1 week ago

philiporlando commented 1 week ago

Describe the bug

I have a polars dataframe that should raise multiple schema validation errors. I want to see all of the errors at once, so I'm setting lazy=True when performing schema validation. Currently, none of the expected errors are returned. Additionally, an unexpected attribute error is returned when switching from LazyFrame to DataFrame.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

This code sample should raise schema validation errors on multiple columns but nothing is returned.

import pandera.polars as pa
from pandera.polars import Check, Column
import polars as pl

x = pl.LazyFrame(
    {
        "foo": ["bar", "baz", "test", "tester"],
        "fruit": ["strawberry", "pear", "banana", "apple"],
        "fruit2": ["strawberry", "pear", "banana", None],
    }
)

s = pa.DataFrameSchema(
    {
        "foo": Column(str, Check.str_length(max_value=4), required=True),
        "fruit": Column(
            str,
            checks=Check.isin(["apple", "strawberry", "pear"]),
            nullable=False,
        ),
        "fruit2": Column(
            str,
            checks=Check.isin(["apple", "strawberry", "pear"]),
            nullable=False,
        ),
    }
)

try:
    s.validate(x, lazy=True).collect() # should raise errors on all three columns 
except pa.errors.SchemaErrors as e:
    print(e.failure_cases)

# Nothing is returned...

When I switch from using a LazyFrame to a DataFrame, I see this error: AttributeError: 'NoneType' object has no attribute 'with_row_count'

import pandera.polars as pa
from pandera.polars import Check, Column
import polars as pl

x = pl.DataFrame(
    {
        "foo": ["bar", "baz", "test", "tester"],
        "fruit": ["strawberry", "pear", "banana", "apple"],
        "fruit2": ["strawberry", "pear", "banana", None],
    }
)

s = pa.DataFrameSchema(
    {
        "foo": Column(str, Check.str_length(max_value=4), required=True),
        "fruit": Column(
            str,
            checks=Check.isin(["apple", "strawberry", "pear"]),
            nullable=False,
        ),
        "fruit2": Column(
            str,
            checks=Check.isin(["apple", "strawberry", "pear"]),
            nullable=False,
        ),
    }
)

try:
    s.validate(x, lazy=True).collect() # should raise errors on all three columns 
except pa.errors.SchemaErrors as e:
    print(e.failure_cases)

# Traceback (most recent call last):
#   File "C:\local\project\test_polars_validate_lazy_true.py", line 26, in <module>
#     s.validate(x, lazy=True).collect() # should raise errors on all three columns
#     ^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\project\.venv\Lib\site-packages\pandera\api\polars\container.py", line 58, in validate
#     output = self.get_backend(check_obj).validate(
#              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\project\.venv\Lib\site-packages\pandera\backends\polars\container.py", line 92, in validate
#     results = check(*args)  # type: ignore[operator]
#               ^^^^^^^^^^^^
#   File "C:\local\project\.venv\Lib\site-packages\pandera\backends\polars\container.py", line 182, in run_schema_component_checks
#     result = schema_component.validate(check_obj, lazy=lazy)
#              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\project\.venv\Lib\site-packages\pandera\api\polars\components.py", line 143, in validate
#     output = self.get_backend(check_obj).validate(
#              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\project\.venv\Lib\site-packages\pandera\backends\polars\components.py", line 95, in validate
#     raise SchemaErrors(
#           ^^^^^^^^^^^^^
#   File "C:\local\project\.venv\Lib\site-packages\pandera\errors.py", line 183, in __init__
#     ).failure_cases_metadata(schema.name, schema_errors)
#       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "C:\local\project\.venv\Lib\site-packages\pandera\backends\polars\base.py", line 151, in failure_cases_metadata
#     index = err.check_output.with_row_count("index").filter(
#             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# AttributeError: 'NoneType' object has no attribute 'with_row_count'

Expected behavior

I would expect the output to match what the pandas dataframe returns:

import pandas as pd
import pandera as pa
from pandera import Check, Column

x = pd.DataFrame(
    {
        "foo": ["bar", "baz", "test", "tester"],
        "fruit": ["strawberry", "pear", "banana", "apple"],
        "fruit2": ["strawberry", "pear", "banana", None],
    }
)

s = pa.DataFrameSchema(
    {
        "foo": Column(str, Check.str_length(max_value=4), required=True),
        "fruit": Column(
            str,
            checks=Check.isin(["apple", "strawberry", "pear"]),
            nullable=False,
        ),
        "fruit2": Column(
            str,
            checks=Check.isin(["apple", "strawberry", "pear"]),
            nullable=False,
        ),
    }
)

try:
    s.validate(x, lazy=True)
except pa.errors.SchemaErrors as e:
    print(e.failure_cases)

#   schema_context  column                                  check check_number failure_case  index
# 0         Column     foo                    str_length(None, 4)            0       tester      3
# 1         Column   fruit  isin(['apple', 'strawberry', 'pear'])            0       banana      2
# 2         Column  fruit2                           not_nullable         None         None      3
# 3         Column  fruit2  isin(['apple', 'strawberry', 'pear'])            0       banana      2

Desktop (please complete the following information):

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

cosmicBboy commented 1 week ago

Thanks for finding this one! See: https://github.com/unionai-oss/pandera/pull/1586

cosmicBboy commented 1 week ago

Please keep these bug reports coming! It helps to iron these out before the stable 0.19.0 release

philiporlando commented 1 week ago

Just pulled the changes within #1586 and can confirm that the expected output is returned! Thanks for addressing these so quickly!

image