multimeric / PandasSchema

A validation library for Pandas data frames using user-friendly schemas
https://multimeric.github.io/PandasSchema/
GNU General Public License v3.0
189 stars 35 forks source link

Data Frame validation megaissue #57

Open multimeric opened 3 years ago

multimeric commented 3 years ago

There is a general need for validations with a scope wider than just a single series. This includes DataFrame level validations, as well as multiple dependent Series validations, such as "ensure each row is distinct, using two columns".

I am aware of this need, and am (very slowly) working on this feature in this branch: https://github.com/TMiguelT/PandasSchema/tree/bitwise. However this has been slow progress as I don't have a lot of time to devote to this project.

I have made this issue so that I can close the duplicate issues with slightly different requests that ultimately come down to this.

praveentiru commented 3 years ago

I had a need for composite key validation where I had to validate that all rows are unique when two columns are combined. I created a custom validation to address this. The constructor for validation is as below: CompositeDistinctValidation(sibling=source['Sales Order Line Number'])

Here, I am providing the other column series as input. If this signature is ok, I can provide the same code. Else, let me know if you have any other thoughts. I can work on a PR for same.

vovavili commented 2 years ago

Well, absence of this feature makes me sad. For the moment this feature is unavailible, does anyone have an idea as to how to validate a value in a pandas dataframe based on value in another field for that specific row? Is biting the bullet and using painfully slow df.iterrows the only way to do this?

Maybe we can set up some sort of collective bounty system to get this megaissue going? I'd be willing to shell out 10 or 15 euro personally.

vovavili commented 2 years ago

I had a need for composite key validation where I had to validate that all rows are unique when two columns are combined. I created a custom validation to address this. The constructor for validation is as below: CompositeDistinctValidation(sibling=source['Sales Order Line Number'])

Here, I am providing the other column series as input. If this signature is ok, I can provide the same code. Else, let me know if you have any other thoughts. I can work on a PR for same.

I am a bit confused, sorry. How does your source dataframe look like? What is the output of this custom validator? How would you use this code to, say, resolve issue outlined in example from #55?