Closed mxblsdl closed 2 months ago
https://pandera.readthedocs.io/en/stable/polars.html#how-it-works
I think this behaviour is expected. pa.Check.in_range(min_value=5, max_value=20)
cannot be performed on pl.LazyFrame
object as it requires reading of the data.
So are checks never assessed for LazyFrame
objects?
I feel like the documentation should make this more explicit or a warning should be issued. The top example comes directly from Pandera documentation and having a check that is never assessed creates a false sense of coverage.
Checks are assessed for LazyFrame
objects, but only those that don't require data being present in the memory are evaluated - so most importantly data types
This is expected behavior @mxblsdl.
I feel like the documentation should make this more explicit
I believe it already does, see https://pandera.readthedocs.io/en/stable/polars.html#how-it-works already linked by @kacper-sellforte.
or a warning should be issued
This is also a good idea. I think a better logging experience here would be helpful. Would you mind opening up a separate issue for this request?
The correct way to support this would be if polars has a first-class expression that asserts whether a column contains any False values, in which case pandera can catch the error lazily when the lazyframe is evaluated. I opened up an issue in the polars project: https://github.com/pola-rs/polars/issues/16120
Also see https://pandera.readthedocs.io/en/stable/polars.html#data-level-validation-with-lazyframes. You can set the environment variable export PANDERA_VALIDATION_DEPTH=SCHEMA_AND_DATA
and pandera will do a LazyFrame.collect
call under the hood and convert back into a LazyFrame
.
okay thank you for taking a look at this. I guess I was just confused on the limits of lazyframe evaluation. I will experiment with the env variable mentioned above and close the issue.
Describe the bug The column checks on polars LazyFrames are not registering errors when they should. Values outside of a defined range pass validation with no warnings or errors. This is not true for polars DataFrame which does register an error.
It looks like this was addressed in a recent PR but I am still seeing the bug in the 0.19.3 release.
Code Sample,
Expected behavior
I would expect a
pandera.errors.SchemaError
to be raised. Note that thepolars.DataFrame
version of this code does raise and error.Desktop (please complete the following information):