unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.39k stars 310 forks source link

check_output : option to turn down checks if output is empty #332

Open ClaireGouze opened 3 years ago

ClaireGouze commented 3 years ago

I'm using the check_output function to check column & datatypes of the DataFrameSchema. My function output can sometimes be an empty dataframe and thus gets a SchemaError, though i would want no error.

Would it be possible to have an option in the check_output function so that no error is raised if output is empty ? Or in the DataFrameSchema ?

Thank you !

cosmicBboy commented 3 years ago

thanks for submitting this feature request @ClaireGouze!

I think this use case should be supported, and here are a few a potential solutions:

  1. add an allow_empty property to the DataFrameSchema and SeriesSchema initializers, such that empty dataframes can pass through without raising a SchemaError. This is nice because it would then cover the check_input case as well.
  2. add an optional option to the check_* decorators, resulting in the same behavior.

I'm leaning on (1), mainly because (2) sort of conflicts with the semantics of Optional[<TYPE>] in the typing module implies that the value can be either None or the <TYPE> specified. allow_empty on the other hand would hold a pandas-specific meaning, which is conceptually cleaner than overloading the "optional" terminology.

Let me know what you think!

jeffzi commented 3 years ago

I'm using the check_output function to check column & datatypes of the DataFrameSchema

If you don't have explicit checks, i.e. just checking column names and types, you could set coerce=True. Obviously, whether that's acceptable depends on your project.

import pandera as pa
import pandas as pd

schema = pa.DataFrameSchema({"A": pa.Column(int)})

@pa.check_output(schema)
def make_empty() -> pd.DataFrame:
    return pd.DataFrame({"A": []})

try:
    make_empty()  # fails
except pa.errors.SchemaError as ex:
    print(ex)
#> error in check_output decorator of function 'make_empty': expected series 'A' to have type int64, got float64

schema_coerced = pa.DataFrameSchema({"A": pa.Column(int)}, coerce=True)

@pa.check_output(schema_coerced)
def make_empty_coerced() -> pd.DataFrame:
    return pd.DataFrame({"A": []})

make_empty_coerced()  # ok
#> Empty DataFrame
Columns: [A]
Index: []

Created on 2020-11-25 by the reprexpy package

If the DataFrame is empty, we can only validate names and types. I think an argument allow_empty should still validate types. Pandera could offer a helper method DataFrameSchema.coerce_dtypes() to let the user coerce locally when the DataFrame is empty. That way coerce can be kept to False globally if that's desirable.

Regarding solution 2., one problem is that you would allow empty DataFrames locally but later validations could fail if optional=True was not set down the pipeline. Moreover, DataFrameSchema.validate() would also need an optional argument if we want to keep a 1:1 mapping with decorator functionalities.

cosmicBboy commented 3 years ago

I think an argument allow_empty should still validate types.

👍

ClaireGouze commented 3 years ago

Thanks for your reply, i think the solution #1 you mentioned would be suitable.

If you don't have explicit checks, i.e. just checking column names and types, you could set coerce=True. Obviously, whether that's acceptable depends on your project.

This would be a good solution but if the output is just an empty dataframe with no column name, it will still fail.

jeffzi commented 3 years ago

What you are asking for is actually to completely disable validation.

I propose to introduce both arguments:

  1. Argument allow_empty for DataFrameSchema/SeriesSchema which still checks names and types on empty DataFrames. Example use cases are dry runs or reading from a source that can be empty. The semantic is that we processed the data successfully but the output is empty.

  2. Argument optional for all check decorators which disables validation when passed a None object. That behavior would be aligned with typing.Optional. The semantic is slightly different than 1. It would signal the fact that we could not process the DataFrame but that's within expectations therefore we do not want to raise an error.

SchemaModel coupled with the decorator check_types already implements 2.

import pandera as pa
from pandera.typing import Series, DataFrame
import pandas as pd
from typing import Optional

class Schema(pa.SchemaModel):
    A: Series[int]

@pa.check_types()
def make_empty() -> Optional[DataFrame[Schema]]:
    return pd.DataFrame()

try:
    make_empty()  # fails
except pa.errors.SchemaError as ex:
    print(ex)
#> error in check_types decorator of function 'make_empty': column 'A' not in dataframe
#> Empty DataFrame
#> Columns: []
#> Index: []

@pa.check_types()
def maybe_df() -> Optional[DataFrame[Schema]]:
    return None

maybe_df() # ok

Created on 2020-11-26 by the reprexpy package

cosmicBboy commented 3 years ago

I think the allow_empty option at the schema-level and optional option for object-based API check_* decorators makes sense.

For the latter, I'm thinking something like this:

import pandas as pd
import pandera as pa

from typing import Optional

schema = pa.DataFrameSchema({
    "col": pa.Column(int)
})

@pa.check_input(schema, optional=True)
def check_input_transform(df):  # or None
    return df

@pa.check_output(schema, optional=True)
def check_output_transform(df):
    return df  # or None

@pa.check_io(df=schema, out=schema, optional={"df": True, "out": True})
def check_output_transform(df):
    return df  # or None

@pa.check_io(
    df=schema, out=(1, schema), optional={"df": True, "out": {1: True}}
)
def check_tuple_output_transform(df):  # or None
    return "foo", df  # or None

@pa.check_io(
    df=schema, out=("bar", schema), optional={"df": True, "out": {"bar": True}}
)
def check_mapping_output_transform(df):  # or None
    return {
        "foo": 1,
        "bar": df,  # or None
    }

This would be a good solution but if the output is just an empty dataframe with no column name, it will still fail.

@ClaireGouze can you provide example code for your use case? I'm trying to wrap my head around the case where a function returns an empty dataframe with no columns, in which case my intuition is that the function should return None instead of pd.DataFrame()

cosmicBboy commented 3 years ago

going to work on this after 0.6.0 release, should be out by next week

ndepaola commented 1 year ago

what's the status of this issue? at my work, we have a data manipulation function which returns a dataframe that should follow a schema, and we use check_types to validate the dataframe against the schema - however, the validator fails when the dataframe is empty (an empty dataframe is a valid output from the function). a column that's typically typed as float gets the pandas dtype object when the dataframe is empty. we can work around this in the short-term by coercing the type on that column, but this will continue to cause issues for us going forward.

einarjohnson commented 10 months ago

+1 on this, also facing this issue when empty dataframes are being used. is the suggested solution in the current version of pandera to use the required keyword and specify all columns to be false with it? https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#required

arkban commented 3 months ago

I ran into this as well.

It's an extra confusing error because the error message implies that the table is not empty, at least the error I get:

expected series 'xxx' to have type datetime64[ns], got object"