Closed Veganveins closed 3 years ago
I suppose you could accomplish this with something like check_min_obs = pa.Check(lambda s: len(s) > 24 * 30, element_wise=False)
might there be a cleaner way to do it, too?
You can create dataframe-wide checks with the checks
argument on DataFrameSchema.init or the dataframe_check decorator for the class-based api.
It's true that the documentation should have an example of dataframe-wide checks for the "regular" api.
import pandas as pd
import pandera as pa
from pandera.typing import Series
schema = pa.DataFrameSchema(
{"A": pa.Column(pa.Int)},
strict=True,
coerce=True,
checks=pa.Check(lambda df: df.shape[0] > 5, name="size > 5"),
)
df = pd.DataFrame({"A": [1, 2]})
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
print(err.failure_cases)
#> schema_context column check check_number failure_case index
#> 0 DataFrameSchema None size > 5 0 False None
## class-based API
class Schema(pa.SchemaModel):
A: Series[float]
@pa.dataframe_check
def not_empty(cls, df):
return not df.empty
empty_df = pd.DataFrame({"A": []})
try:
Schema.validate(empty_df, lazy=True)
except pa.errors.SchemaErrors as err:
print(err.failure_cases)
#> schema_context column check check_number failure_case index
#> 0 DataFrameSchema None not_empty 0 False None
Created on 2021-01-12 by the reprexpy package
thank you! this is perfect
hi @Veganveins, glad you got your Q addressed! Yeah the documentation has this sort of tucked away in the Checks section: https://pandera.readthedocs.io/en/stable/checks.html#wide-checks
I'm wondering whether row-count checks (n_rows
, min_rows
, max_rows
) should be:
Check
sDataFrameSchema.__init__
(2) would probably be better, since it makes more sense to specify this at the dataframe-level vs. at the column level, and would be better-supported in concert with data synthesis strategies. It'd also be relatively easy to extend to SchemaModel
s
pa.DataFrameSchema(..., n_rows = <int>, min_rows = <int>, max_rows = <int>)
class Schema(pa.SchemaModel):
class Config:
n_rows = <int>
min_rows = <int>
max_rows = <int>
ahh yeah (2) would be pretty slick! are you thinking that someone could open a PR to contribute what you're describing above as a new feature?
One issue with 2. is that it could bloat the api, especially SchemaModel
. Afterwards we could imagine other checks that could be done dataframe-wide, where do we draw the line?
Requiring a non-empty DataFrame (min_rows = 1
) is fairly common but what about n_rows
and max_rows
?
We suggest introducing built-in checks for dataframe, similarly to built-in column checks.
pa.Check.MinShape(cls, shape: Union[Tuple[int], int])
: shape
can be a tuple for DataFrames or int for Series (same as numpy api). That check would also work on SeriesSchema
s.pa.Check.MaxShape
: same idea as abovepa.Check.NotEmpty
: also works on DataFrames and Seriespa.DataFrameSchema(..., checks = pa.Check.NotEmpty())
class Schema(pa.SchemaModel):
class Config:
checks = [pa.Check.NotEmpty()] # new
^ Maybe a bit verbose, we could perhaps have a not_empty
shortcut in DataFrameSchema.__init__
if this use-case seems very frequent?
Note: Regarding SchemaModel
, pydantic has another mechanism that wouldn't work for built-in checks (pydantic doesnt have any).
Cool, I made https://github.com/pandera-dev/pandera/issues/383 to discuss this further and will close this ticket. @Veganveins feel free to chime in there if you have any thoughts!
Question about pandera
If I create a DataFrameSchema, is there a way to create a "check" that will tell me whether any input data frame I want to validate has at least a specified number of rows? I tried checking the documentation but I couldn't see anything explicit. If anyone has a hint for how to specify a schema, such that it expects valid inputs to have say at least 1,000 rows, that would be a big help!
In a similar vein, is there a way to add a check for whether the input dataframe is not "empty" ?
Thanks in advance for your help :)