unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

When mixing `drop_invalid_rows` on `DataFrameSchema` and `Column` level we get a non intuitive behavior #1737

Open jherrmannNetfonds opened 2 months ago

jherrmannNetfonds commented 2 months ago

Describe the bug When mixing drop_invalid_rows on DataFrameSchema and Column level we get a non intuitive behavior.

  1. If you set drop_invalid_rows as a DataFrameSchema parameter and have no drop_invalid_rows as column parameter, all rows which fail the validation are dropped. Works as expected.
  2. When setting drop_invalid_rows as column parameter and not as DataFrameSchema parameter, columns which fail are not dropped and no error is raised. Listing [1]
  3. If set drop_invalid_rows=True on DataFrameSchema and at a Column. Columns with drop_invalid_rows=True are not dropped and no error is risen and columns with drop_invalid_rows=False are dropped. Listing [2]

If this behavior is indented, we should document it, otherwise see the expected results

Code Sample

Listing [1]

import pandas as pd
from pandera import Check, Column, DataFrameSchema

df = pd.DataFrame(
    {
        "counter": [1, 2, 3, 4],
        "text": ["abc", "def", "ghi", None],
    }
)
schema = DataFrameSchema(
    {
        "counter": Column(
            int,
            checks=[Check(lambda x: x >= 3)],
            drop_invalid_rows=True,
        ),
        "text": Column(
            str,
            nullable=False,
            drop_invalid_rows=True,
        ),
    },
)

schema.validate(df, lazy=True)
counter text
1 abc
2 def
3 ghi
4 None

Listing [2]

import pandas as pd
from pandera import Check, Column, DataFrameSchema

df = pd.DataFrame(
    {
        "counter": [1, 2, 3, 4],
        "text": ["abc", "def", "ghi", None],
    }
)
schema = DataFrameSchema(
    {
        "counter": Column(
            int,
            checks=[Check(lambda x: x >= 3)],
            drop_invalid_rows=True,
        ),
        "text": Column(
            str,
            nullable=False,
            drop_invalid_rows=False,
        ),
    },
    drop_invalid_rows=True,
)

schema.validate(df, lazy=True)
counter text
1 abc
2 def
3 gh

Expected behavior

For listing [1] I would expect the columns to be dropped with drop_invalid_rows=True or get a warning that I have to set drop_invalid_rows=True as DataFrameSchema parameter For listing [2] I would expect the columns with drop_invalid_rows=True as column parameter to be dropped and the other to raise an error.

Desktop

cosmicBboy commented 2 months ago

Thanks for reporting this @jherrmannNetfonds, this is definitely a bug, the two cases you listed should work as you expect. Will look into this