`add_missing_columns` sometimes adds same missing column multiple times

r-terada commented 1 year ago

Describe the bug

When add_missing_columns is set to True in DataFrameModel, the same missing_columns can sometimes be added multiple times. The specific conditions under which this occurs have not yet been investigated, but it seems to occur when there are two or more extra columns.

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandera.
[x] (optional) I have confirmed this bug exists on the master branch of pandera.

Code Sample, a copy-pastable example

import pandas as pd
import pandera as pa
from pandera.typing import Series

class TestAddMissingColumns(pa.DataFrameModel):
    col_a: Series[str]
    col_missing: Series[str] = pa.Field(nullable=True)

    class Config:
        add_missing_columns = True

schema = TestAddMissingColumns.to_schema()

df = pd.DataFrame({
    "col_a": ["a", "b", "c"],
    "col_b": ["d", "e", "f"]
})
print(schema.validate(df))
# -> works well
#  col_a col_missing col_b
# 0     a         NaN     d
# 1     b         NaN     e
# 2     c         NaN     f

df = pd.DataFrame({
    "col_a": ["a", "b", "c"],
    "col_b": ["d", "e", "f"],
    "col_c": ["g", "h", "i"]
})
print(schema.validate(df))
# -> it adds 2 "col_missing"
#  col_a col_missing col_b col_missing col_c
# 0     a         NaN     d         NaN     g
# 1     b         NaN     e         NaN     h
# 2     c         NaN     f         NaN     i

Expected behavior

add only 1 col_missing

#  col_a col_missing col_b col_c
# 0     a         NaN     d     g
# 1     b         NaN     e     h
# 2     c         NaN     f     i

Environment

OS: MacOS Monterey (12.6.2)

python and lib versions

$ python -V
Python 3.10.3
$ pip freeze | grep "pandas\|pandera"
geopandas==0.14.0
pandas==2.0.3
pandas-stubs==2.0.3.230814
-e git+https://github.com/unionai-oss/pandera@ceeae10f0fcca5f34de99d5e1c107ddacff51b73#egg=pandera

derinwalters commented 1 year ago

Nice find! Looks like a bug in the missing column insertion logic that occurs when multiple columns not in the schema, in this case "col_b" and "col_c", are positioned after the missing column location in the dataframe to be validated. Thank you for the great working example. I'll submit a pull request shortly.

r-terada commented 1 year ago

Thank you for the investigation and quick fix! I'm waiting for your pull request to be merged :)

aphorton commented 7 months ago

First, thank you very much for this fantastic package.

The code in OP's example runs as intended now for pandera 0.18.0, but adding non-unique column names causes a similar column-duplication problem.

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
    {
        "col_a": pa.Column(str),
        "col_missing": pa.Column(str, nullable=True)
    },
    add_missing_columns=True
)

df = pd.DataFrame({
    "col_a": ["a", "b", "c"],
    "col_b": ["d", "e", "f"]
})

print(schema.validate(df))
# -> works well
#   col_a col_missing col_b
# 0     a        None     d
# 1     b        None     e
# 2     c        None     f

df.columns = ["col_a", "col_a"]
print(schema.validate(df))
# -> duplicates columns
#   col_a col_a col_missing col_a col_a
# 0     a     d        None     a     d
# 1     b     e        None     b     e
# 2     c     f        None     c     f

Expected behavior

add only 1 col_missing
#   col_a col_a col_missing
# 0     a     d        None
# 1     b     e        None
# 2     c     f        None

unionai-oss / pandera