Open r-terada opened 1 year ago
Nice find! Looks like a bug in the missing column insertion logic that occurs when multiple columns not in the schema, in this case "col_b" and "col_c", are positioned after the missing column location in the dataframe to be validated. Thank you for the great working example. I'll submit a pull request shortly.
Thank you for the investigation and quick fix! I'm waiting for your pull request to be merged :)
First, thank you very much for this fantastic package.
The code in OP's example runs as intended now for pandera 0.18.0, but adding non-unique column names causes a similar column-duplication problem.
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema(
{
"col_a": pa.Column(str),
"col_missing": pa.Column(str, nullable=True)
},
add_missing_columns=True
)
df = pd.DataFrame({
"col_a": ["a", "b", "c"],
"col_b": ["d", "e", "f"]
})
print(schema.validate(df))
# -> works well
# col_a col_missing col_b
# 0 a None d
# 1 b None e
# 2 c None f
df.columns = ["col_a", "col_a"]
print(schema.validate(df))
# -> duplicates columns
# col_a col_a col_missing col_a col_a
# 0 a d None a d
# 1 b e None b e
# 2 c f None c f
add only 1 col_missing
# col_a col_a col_missing
# 0 a d None
# 1 b e None
# 2 c f None
Describe the bug
When
add_missing_columns
is set to True inDataFrameModel
, the same missing_columns can sometimes be added multiple times. The specific conditions under which this occurs have not yet been investigated, but it seems to occur when there are two or more extra columns.Code Sample, a copy-pastable example
Expected behavior
add only 1
col_missing
Environment