unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.34k stars 308 forks source link

`add_missing_columns` sometimes adds same missing column multiple times #1370

Open r-terada opened 1 year ago

r-terada commented 1 year ago

Describe the bug

When add_missing_columns is set to True in DataFrameModel, the same missing_columns can sometimes be added multiple times. The specific conditions under which this occurs have not yet been investigated, but it seems to occur when there are two or more extra columns.

Code Sample, a copy-pastable example

import pandas as pd
import pandera as pa
from pandera.typing import Series

class TestAddMissingColumns(pa.DataFrameModel):
    col_a: Series[str]
    col_missing: Series[str] = pa.Field(nullable=True)

    class Config:
        add_missing_columns = True

schema = TestAddMissingColumns.to_schema()

df = pd.DataFrame({
    "col_a": ["a", "b", "c"],
    "col_b": ["d", "e", "f"]
})
print(schema.validate(df))
# -> works well
#  col_a col_missing col_b
# 0     a         NaN     d
# 1     b         NaN     e
# 2     c         NaN     f

df = pd.DataFrame({
    "col_a": ["a", "b", "c"],
    "col_b": ["d", "e", "f"],
    "col_c": ["g", "h", "i"]
})
print(schema.validate(df))
# -> it adds 2 "col_missing"
#  col_a col_missing col_b col_missing col_c
# 0     a         NaN     d         NaN     g
# 1     b         NaN     e         NaN     h
# 2     c         NaN     f         NaN     i

Expected behavior

add only 1 col_missing

#  col_a col_missing col_b col_c
# 0     a         NaN     d     g
# 1     b         NaN     e     h
# 2     c         NaN     f     i

Environment

derinwalters commented 1 year ago

Nice find! Looks like a bug in the missing column insertion logic that occurs when multiple columns not in the schema, in this case "col_b" and "col_c", are positioned after the missing column location in the dataframe to be validated. Thank you for the great working example. I'll submit a pull request shortly.

r-terada commented 1 year ago

Thank you for the investigation and quick fix! I'm waiting for your pull request to be merged :)

aphorton commented 7 months ago

First, thank you very much for this fantastic package.

The code in OP's example runs as intended now for pandera 0.18.0, but adding non-unique column names causes a similar column-duplication problem.

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
    {
        "col_a": pa.Column(str),
        "col_missing": pa.Column(str, nullable=True)
    },
    add_missing_columns=True
)

df = pd.DataFrame({
    "col_a": ["a", "b", "c"],
    "col_b": ["d", "e", "f"]
})

print(schema.validate(df))
# -> works well
#   col_a col_missing col_b
# 0     a        None     d
# 1     b        None     e
# 2     c        None     f

df.columns = ["col_a", "col_a"]
print(schema.validate(df))
# -> duplicates columns
#   col_a col_a col_missing col_a col_a
# 0     a     d        None     a     d
# 1     b     e        None     b     e
# 2     c     f        None     c     f

Expected behavior

add only 1 col_missing
#   col_a col_a col_missing
# 0     a     d        None
# 1     b     e        None
# 2     c     f        None