Check dataframe size? - Githubissues

Veganveins commented 3 years ago

Question about pandera

If I create a DataFrameSchema, is there a way to create a "check" that will tell me whether any input data frame I want to validate has at least a specified number of rows? I tried checking the documentation but I couldn't see anything explicit. If anyone has a hint for how to specify a schema, such that it expects valid inputs to have say at least 1,000 rows, that would be a big help!

In a similar vein, is there a way to add a check for whether the input dataframe is not "empty" ?

Thanks in advance for your help :)

Veganveins commented 3 years ago

I suppose you could accomplish this with something like check_min_obs = pa.Check(lambda s: len(s) > 24 * 30, element_wise=False)

might there be a cleaner way to do it, too?

jeffzi commented 3 years ago

You can create dataframe-wide checks with the checks argument on DataFrameSchema.init or the dataframe_check decorator for the class-based api.

It's true that the documentation should have an example of dataframe-wide checks for the "regular" api.

import pandas as pd
import pandera as pa
from pandera.typing import Series

schema = pa.DataFrameSchema(
    {"A": pa.Column(pa.Int)},
    strict=True,
    coerce=True,
    checks=pa.Check(lambda df: df.shape[0] > 5, name="size > 5"),
)
df = pd.DataFrame({"A": [1, 2]})
try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
    print(err.failure_cases)
#>     schema_context column     check  check_number  failure_case index
#> 0  DataFrameSchema   None  size > 5             0         False  None

## class-based API
class Schema(pa.SchemaModel):
    A: Series[float]

    @pa.dataframe_check
    def not_empty(cls, df):
        return not df.empty

empty_df = pd.DataFrame({"A": []})
try:
    Schema.validate(empty_df, lazy=True)
except pa.errors.SchemaErrors as err:
    print(err.failure_cases)
#>     schema_context column      check  check_number  failure_case index
#> 0  DataFrameSchema   None  not_empty             0         False  None

^{Created on 2021-01-12 by the reprexpy package}

Veganveins commented 3 years ago

thank you! this is perfect

cosmicBboy commented 3 years ago

hi @Veganveins, glad you got your Q addressed! Yeah the documentation has this sort of tucked away in the Checks section: https://pandera.readthedocs.io/en/stable/checks.html#wide-checks

I'm wondering whether row-count checks (n_rows, min_rows, max_rows) should be:

built-in Checks
key-word arguments in DataFrameSchema.__init__

(2) would probably be better, since it makes more sense to specify this at the dataframe-level vs. at the column level, and would be better-supported in concert with data synthesis strategies. It'd also be relatively easy to extend to SchemaModels

pa.DataFrameSchema(..., n_rows = <int>, min_rows = <int>, max_rows = <int>)

class Schema(pa.SchemaModel):
    class Config:
        n_rows = <int>
        min_rows = <int>
        max_rows = <int>

Veganveins commented 3 years ago

ahh yeah (2) would be pretty slick! are you thinking that someone could open a PR to contribute what you're describing above as a new feature?

jeffzi commented 3 years ago

One issue with 2. is that it could bloat the api, especially SchemaModel. Afterwards we could imagine other checks that could be done dataframe-wide, where do we draw the line?

Requiring a non-empty DataFrame (min_rows = 1) is fairly common but what about n_rows and max_rows?

We suggest introducing built-in checks for dataframe, similarly to built-in column checks.

pa.Check.MinShape(cls, shape: Union[Tuple[int], int]): shape can be a tuple for DataFrames or int for Series (same as numpy api). That check would also work on SeriesSchemas.
pa.Check.MaxShape: same idea as above
pa.Check.NotEmpty: also works on DataFrames and Series

pa.DataFrameSchema(..., checks = pa.Check.NotEmpty())

class Schema(pa.SchemaModel):
    class Config:
        checks = [pa.Check.NotEmpty()] # new

^ Maybe a bit verbose, we could perhaps have a not_empty shortcut in DataFrameSchema.__init__ if this use-case seems very frequent?

Note: Regarding SchemaModel, pydantic has another mechanism that wouldn't work for built-in checks (pydantic doesnt have any).

cosmicBboy commented 3 years ago

Cool, I made https://github.com/pandera-dev/pandera/issues/383 to discuss this further and will close this ticket. @Veganveins feel free to chime in there if you have any thoughts!

unionai-oss / pandera

Check dataframe size? #382

Question about pandera