unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.24k stars 302 forks source link

strict=True is not very strict on the index #1493

Open smarie opened 6 months ago

smarie commented 6 months ago

Hi, first of all thanks for this great library !

I today found out that I had a non-validated-enough dataframe, even if I was using strict=True. This was due to the fact that strict=True does not imply any kind of checks on the index.

Here is an example :

class FooModel(pa.DataFrameModel):
    a: pa.typing.Series[int]

    class Config:
        strict = True

As a user, when I run FooModel.validate(df), since I added strict=True, I would expect that any error or missing aspect in FooModel leads to an exception being raised. At the contrary, if I do not see any exception, that leads me to think that my FooModel is correct.

Yet,

df = pd.DataFrame(index=["hello"], data={"a": [1]})
df.index.name = "foo"
FooModel.validate(df)

does not raise any error. It breaks somehow the semantics of strict=True in my opinion, as it leaves some room for flexibility in the dataframe to be validated. In this example the non-None name on the index of df, and the fact that the index has dtype object. Do you agree ?

I would suggest to modify strict=True to perform the following: when the schema does not contain any specification about the index, validate that the index is the default pandas index (a rangeindex with no name).

daniel-ene-heni commented 3 months ago

I noticed this too. Regardless of this flag, Pandera raises a column_in_dataframe check error only if a non-nullable column is missing. However, a column missing altogether is a different issue, separate from nullability, and of a different severity.

cosmicBboy commented 3 months ago

currently strict only operates on columns: https://github.com/unionai-oss/pandera/blob/main/pandera/backends/pandas/container.py#L480C9-L531

I would suggest to modify strict=True to perform the following: when the schema does not contain any specification about the index, validate that the index is the default pandas index (a rangeindex with no name).

Feel free to open up a PR for this! @smarie @daniel-ene-heni