unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.05k stars 281 forks source link

failure_case conversion failed : polars.exceptions.ComputeError - pandera(0.19.0b3) with polars #1607

Closed obiii closed 2 weeks ago

obiii commented 3 weeks ago

Describe the bug We are trying a simple validation example using polars. We cant understand the problem or why it originates. But it throws polars.exceptions.ComputeError exception when any of the validation fails and there is null in data.

For example, in the code below, the dummy data contains extract_date feature with a None. It runs fine if the case_id are all int convertible string but throws the exception if any of the case_id is not int convertible.

Here is the code:

import pandera.polars as pa
import polars as pl
from datetime import date
import json

class CaseSchema(pa.DataFrameModel):
    case_id: int = pa.Field(nullable=False, unique=True, coerce=True)
    gdwh_portfolio_id: str = pa.Field(nullable=False, unique=True, coerce=True)
    extract_date: date = pa.Field(nullable=True, coerce=True)

    class Config:
        drop_invalid_rows = True

invalid_lf = pl.DataFrame({
    #"case_id": ["1", "2", "3"],
    "case_id": ["1", "2", "abc"],
    "gdwh_portfolio_id": ["d", "e", "f"],
    "extract_date": [date(2024,1,1), date(2024,1,2), None]
})

try:
    CaseSchema.validate(invalid_lf, lazy=True)
except pa.errors.SchemaErrors as e:
        print(json.dumps(e.message, indent=4))

It gives: 'conversion from struct[29] to str failed in column 'failure_case' for 1 out of 1 values [{"abc","f",null}] If you uncomment "case_id": ["1", "2", "3"], and comment "case_id": ["1", "2", "abc"] it runs fine.

Not sure why it panics when there are nulls. If there are no nulls in the data it works fine.

The trace we get is:


> Traceback (most recent call last):
>   File "<frozen runpy>", line 198, in _run_module_as_main
>   File "<frozen runpy>", line 88, in _run_code
>   File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/erehoba-acc-payments-req/code/Users/ourrehman/dna-payments-and-accounts/data_validation/test.py", line 22, in <module>
>     CaseSchema.validate(invalid_lf, lazy=True)
>   File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/api/dataframe/model.py", line 289, in validate
>     cls.to_schema().validate(
>   File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/api/polars/container.py", line 58, in validate
>     output = self.get_backend(check_obj).validate(
>              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/backends/polars/container.py", line 65, in validate
>     check_obj = parser(check_obj, *args)
>                 ^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/backends/polars/container.py", line 398, in coerce_dtype
>     check_obj = self._coerce_dtype_helper(check_obj, schema)
>                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/backends/polars/container.py", line 486, in _coerce_dtype_helper
>     raise SchemaErrors(
>           ^^^^^^^^^^^^^
>   File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/errors.py", line 183, in __init__
>     ).failure_cases_metadata(schema.name, schema_errors)
>       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/pandera/backends/polars/base.py", line 173, in failure_cases_metadata
>     ).cast(
>       ^^^^^
>   File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/polars/dataframe/frame.py", line 6624, in cast
>     return self.lazy().cast(dtypes, strict=strict).collect(_eager=True)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/anaconda/envs/pandera-polars/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1810, in collect
>     return wrap_df(ldf.collect())
>                    ^^^^^^^^^^^^^
> polars.exceptions.ComputeError: conversion from `struct[3]` to `str` failed in column 'failure_case' for 1 out of 1 values: [{"abc","f",null}]

Expected behavior

It should work with column that have null and are set nullable=True

versions

pandera: 0.19.0b3 polars: 0.20.23 python: 3.11

cmdlineluser commented 3 weeks ago

I'm not a pandera user - but this is my understanding of why it is failing:

It seems the failure_case column can be a string or a struct.

In the case of a struct, this fails:

https://github.com/unionai-oss/pandera/blob/dbf18314fc9461b3f7af8ff6c4741a6dff0f99ac/pandera/backends/polars/base.py#L173-L175

import polars as pl

df = pl.DataFrame({
    'failure_case': [{'case_id': 'abc', 'extract_date': None}]
})

df.with_columns(pl.col("failure_case").cast(pl.String))
# ComputeError: conversion from `struct[2]` to `str` failed in column ...

A struct can be "stringified" in Polars via .struct.json_encode()

>>> df.with_columns(pl.col("failure_case").struct.json_encode())
shape: (1, 1)
┌───────────────────────────────────────┐
│ failure_case                          │
│ ---                                   │
│ str                                   │
╞═══════════════════════════════════════╡
│ {"case_id":"abc","extract_date":null} │
└───────────────────────────────────────┘

But I'm not sure if that's what pandera wants to do in this case.

cosmicBboy commented 2 weeks ago

Good catch! #1608 should address this

obiii commented 2 weeks ago

Hi @cosmicBboy

I was previously using the 0.19.3b that I installed using Pip install pre ‘pandera[polars]’

I dnt see the new tag woth your PR. Can you please let me know how do I use/install the updatws you have made in this PR?

cosmicBboy commented 2 weeks ago

Just cut a new beta release: https://github.com/unionai-oss/pandera/releases/tag/v0.19.0b4

obiii commented 2 weeks ago

Thanks!