unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.3k stars 308 forks source link

Polars checks not being evaluated correctly #1662

Closed mxblsdl closed 2 months ago

mxblsdl commented 4 months ago

Describe the bug The column checks on polars LazyFrames are not registering errors when they should. Values outside of a defined range pass validation with no warnings or errors. This is not true for polars DataFrame which does register an error.

It looks like this was addressed in a recent PR but I am still seeing the bug in the 0.19.3 release.

Code Sample,

# This code is taken from the examples page [here](https://pandera--1373.org.readthedocs.build/en/1373/polars.html)
# With values changed to be outside the define range.

import pandera.polars as pa
import polars as pl

schema = pa.DataFrameSchema(
    {
        "state": pa.Column(str),
        "city": pa.Column(str),
        "price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20)), # check is defined
    }
)

lf = pl.LazyFrame(
    {
        "state": ["FL", "FL", "FL", "CA", "CA", "CA"],
        "city": [
            "Orlando",
            "Miami",
            "Tampa",
            "San Francisco",
            "Los Angeles",
            "San Diego",
        ],
        "price": [2, 12, 10, 16, 20, 180], # values outside of defined range are passed
    }
)
print(schema.validate(lf).collect()) # no errors are raised

Expected behavior

I would expect a pandera.errors.SchemaError to be raised. Note that the polars.DataFrame version of this code does raise and error.

import pandera.polars as pa
import polars as pl

schema = pa.DataFrameSchema(
    {
        "state": pa.Column(str),
        "city": pa.Column(str),
        "price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20)),
    }
)

lf = pl.DataFrame(
    {
        "state": ["FL", "FL", "FL", "CA", "CA", "CA"],
        "city": [
            "Orlando",
            "Miami",
            "Tampa",
            "San Francisco",
            "Los Angeles",
            "San Diego",
        ],
        "price": [2, 12, 10, 16, 20, 180],
    }
)
print(schema.validate(lf))

Desktop (please complete the following information):

kacper-sellforte commented 3 months ago
Screenshot 2024-06-12 at 21 03 31

https://pandera.readthedocs.io/en/stable/polars.html#how-it-works

I think this behaviour is expected. pa.Check.in_range(min_value=5, max_value=20) cannot be performed on pl.LazyFrame object as it requires reading of the data.

mxblsdl commented 3 months ago

So are checks never assessed for LazyFrame objects?

I feel like the documentation should make this more explicit or a warning should be issued. The top example comes directly from Pandera documentation and having a check that is never assessed creates a false sense of coverage.

kacper-sellforte commented 3 months ago

Checks are assessed for LazyFrame objects, but only those that don't require data being present in the memory are evaluated - so most importantly data types

cosmicBboy commented 2 months ago

This is expected behavior @mxblsdl.

I feel like the documentation should make this more explicit

I believe it already does, see https://pandera.readthedocs.io/en/stable/polars.html#how-it-works already linked by @kacper-sellforte.

or a warning should be issued

This is also a good idea. I think a better logging experience here would be helpful. Would you mind opening up a separate issue for this request?

The correct way to support this would be if polars has a first-class expression that asserts whether a column contains any False values, in which case pandera can catch the error lazily when the lazyframe is evaluated. I opened up an issue in the polars project: https://github.com/pola-rs/polars/issues/16120

cosmicBboy commented 2 months ago

Also see https://pandera.readthedocs.io/en/stable/polars.html#data-level-validation-with-lazyframes. You can set the environment variable export PANDERA_VALIDATION_DEPTH=SCHEMA_AND_DATA and pandera will do a LazyFrame.collect call under the hood and convert back into a LazyFrame.

mxblsdl commented 2 months ago

okay thank you for taking a look at this. I guess I was just confused on the limits of lazyframe evaluation. I will experiment with the env variable mentioned above and close the issue.