multimeric / PandasSchema

A validation library for Pandas data frames using user-friendly schemas
https://multimeric.github.io/PandasSchema/
GNU General Public License v3.0
189 stars 35 forks source link

isDistinctValidation(_SeriesValidation) for a combination of columns i.e. as composite keys #2

Closed diegoquintanav closed 3 years ago

diegoquintanav commented 6 years ago

Hi there.

Consider the following schema with fake column names

schema = Schema([
    Column('year',[InRangeValidation(1900, 3000), IsDistinctValidation()]),
    Column('id',[IsDistinctValidation()])
])

this works on top of the series.duplicated method of pandas.

    def validate(self, series: pd.Series) -> pd.Series:
        return ~series.duplicated(keep='first')

Consider that there is also a method for Dataframes, is it possible to establish composite columns so IsDistinctValidation() checks for combinations also? kind of an additional parameter **columns as a list of columns defined inside the same schema passed to isDistinctValidation().

What I do now is to insert a new temporary column as a tuple out of the elements I want to check i.e.

df.insert(loc=0, column='composite__year__id', value=list(zip(df.year, df.id)), allow_duplicates=False)

and then in the schema add the column as

Column('composite__year__id',[IsDistinctValidation()])

BTW nice job and thanks!

markusbaden commented 6 years ago

I'd be interested in something like this as well. In general it would be nice to have validation across columns. Not sure what's the best way though to generalize the current schema which is centered on independent columns.

@TMiguelT have you got any suggestions?

multimeric commented 6 years ago

Hmm. This seems like a useful validation to have. I'll have to think about how to handle DataFrame-level validations in terms of the interface

markusbaden commented 6 years ago

Another one we are using is something like "if col a has value x then col b needs to have value in list c", so you would need to some sort of constraint that works on the data frame itself. Something like SeriesValidation but which accepts a DataFrame in validate.

multimeric commented 6 years ago

Good point. There's probably a need for a generalised DataFrame-level validation

diegoquintanav commented 6 years ago

(off-topic) @TMiguelT are you expecting contributions? Perhaps a gitter chat?

multimeric commented 6 years ago

I'm happy to have contributions for this or any other feature requests. I've commented on your other PR

quipa commented 6 years ago

I am interested in this enhancement too. In my case I would be using it to check if a total count column is in fact equal to the total of several category count columns. Thanks!

multimeric commented 3 years ago

Closing in favour of the more general #57 that I just opened.