Open ClaireGouze opened 3 years ago
thanks for submitting this feature request @ClaireGouze!
I think this use case should be supported, and here are a few a potential solutions:
allow_empty
property to the DataFrameSchema
and SeriesSchema
initializers, such that empty dataframes can
pass through without raising a SchemaError
. This is nice because it would then cover the check_input
case as well.optional
option to the check_*
decorators, resulting in the same behavior.I'm leaning on (1), mainly because (2) sort of conflicts with the semantics of Optional[<TYPE>]
in the typing
module implies that the value can be either None
or the <TYPE>
specified. allow_empty
on the other hand would hold a pandas-specific meaning, which is conceptually cleaner than overloading the "optional" terminology.
Let me know what you think!
I'm using the check_output function to check column & datatypes of the DataFrameSchema
If you don't have explicit checks, i.e. just checking column names and types, you could set coerce=True
. Obviously, whether that's acceptable depends on your project.
import pandera as pa
import pandas as pd
schema = pa.DataFrameSchema({"A": pa.Column(int)})
@pa.check_output(schema)
def make_empty() -> pd.DataFrame:
return pd.DataFrame({"A": []})
try:
make_empty() # fails
except pa.errors.SchemaError as ex:
print(ex)
#> error in check_output decorator of function 'make_empty': expected series 'A' to have type int64, got float64
schema_coerced = pa.DataFrameSchema({"A": pa.Column(int)}, coerce=True)
@pa.check_output(schema_coerced)
def make_empty_coerced() -> pd.DataFrame:
return pd.DataFrame({"A": []})
make_empty_coerced() # ok
#> Empty DataFrame
Columns: [A]
Index: []
Created on 2020-11-25 by the reprexpy package
If the DataFrame is empty, we can only validate names and types. I think an argument allow_empty
should still validate types. Pandera could offer a helper method DataFrameSchema.coerce_dtypes()
to let the user coerce locally when the DataFrame is empty. That way coerce
can be kept to False
globally if that's desirable.
Regarding solution 2., one problem is that you would allow empty DataFrames locally but later validations could fail if optional=True
was not set down the pipeline. Moreover, DataFrameSchema.validate()
would also need an optional
argument if we want to keep a 1:1 mapping with decorator functionalities.
I think an argument allow_empty should still validate types.
👍
Thanks for your reply, i think the solution #1 you mentioned would be suitable.
If you don't have explicit checks, i.e. just checking column names and types, you could set
coerce=True
. Obviously, whether that's acceptable depends on your project.
This would be a good solution but if the output is just an empty dataframe with no column name, it will still fail.
What you are asking for is actually to completely disable validation.
I propose to introduce both arguments:
Argument allow_empty
for DataFrameSchema
/SeriesSchema
which still checks names and types on empty DataFrames. Example use cases are dry runs or reading from a source that can be empty. The semantic is that we processed the data successfully but the output is empty.
Argument optional
for all check decorators which disables validation when passed a None
object. That behavior would be aligned with typing.Optional. The semantic is slightly different than 1. It would signal the fact that we could not process the DataFrame but that's within expectations therefore we do not want to raise an error.
SchemaModel coupled with the decorator check_types already implements 2.
import pandera as pa
from pandera.typing import Series, DataFrame
import pandas as pd
from typing import Optional
class Schema(pa.SchemaModel):
A: Series[int]
@pa.check_types()
def make_empty() -> Optional[DataFrame[Schema]]:
return pd.DataFrame()
try:
make_empty() # fails
except pa.errors.SchemaError as ex:
print(ex)
#> error in check_types decorator of function 'make_empty': column 'A' not in dataframe
#> Empty DataFrame
#> Columns: []
#> Index: []
@pa.check_types()
def maybe_df() -> Optional[DataFrame[Schema]]:
return None
maybe_df() # ok
Created on 2020-11-26 by the reprexpy package
I think the allow_empty
option at the schema-level and optional
option for object-based API check_*
decorators makes sense.
For the latter, I'm thinking something like this:
import pandas as pd
import pandera as pa
from typing import Optional
schema = pa.DataFrameSchema({
"col": pa.Column(int)
})
@pa.check_input(schema, optional=True)
def check_input_transform(df): # or None
return df
@pa.check_output(schema, optional=True)
def check_output_transform(df):
return df # or None
@pa.check_io(df=schema, out=schema, optional={"df": True, "out": True})
def check_output_transform(df):
return df # or None
@pa.check_io(
df=schema, out=(1, schema), optional={"df": True, "out": {1: True}}
)
def check_tuple_output_transform(df): # or None
return "foo", df # or None
@pa.check_io(
df=schema, out=("bar", schema), optional={"df": True, "out": {"bar": True}}
)
def check_mapping_output_transform(df): # or None
return {
"foo": 1,
"bar": df, # or None
}
This would be a good solution but if the output is just an empty dataframe with no column name, it will still fail.
@ClaireGouze can you provide example code for your use case? I'm trying to wrap my head around the case where a function returns an empty dataframe with no columns, in which case my intuition is that the function should return None
instead of pd.DataFrame()
going to work on this after 0.6.0
release, should be out by next week
what's the status of this issue? at my work, we have a data manipulation function which returns a dataframe that should follow a schema, and we use check_types
to validate the dataframe against the schema - however, the validator fails when the dataframe is empty (an empty dataframe is a valid output from the function). a column that's typically typed as float gets the pandas dtype object
when the dataframe is empty. we can work around this in the short-term by coercing the type on that column, but this will continue to cause issues for us going forward.
+1 on this, also facing this issue when empty dataframes are being used. is the suggested solution in the current version of pandera to use the required
keyword and specify all columns to be false with it? https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#required
I ran into this as well.
It's an extra confusing error because the error message implies that the table is not empty, at least the error I get:
expected series 'xxx' to have type datetime64[ns], got object"
I'm using the check_output function to check column & datatypes of the DataFrameSchema. My function output can sometimes be an empty dataframe and thus gets a SchemaError, though i would want no error.
Would it be possible to have an option in the check_output function so that no error is raised if output is empty ? Or in the DataFrameSchema ?
Thank you !