multimeric / PandasSchema

A validation library for Pandas data frames using user-friendly schemas
https://multimeric.github.io/PandasSchema/
GNU General Public License v3.0
189 stars 35 forks source link

Contingent validation & optional columns #31

Open kotarf opened 4 years ago

kotarf commented 4 years ago

There is a use case for validation where we want to contingently validate a column. We might not require such a column, but will want to validate it if it happens to be provided. This would come up when PandasSchema is used in an 'ETL' pipeline where data is submitted by users or 3rd parties.

I don't think _allowempty has this behavior so I introduced a new parameter. It's close but it has subtly different behavior.

This might go against the design of the library in that all provided columns are considered required in the Schema. In which case, I guess this feature doesn't make sense.

Otherwise, I don't think this PR is ready to go (eg. more tests and the logic might be off) but I'm happy to continue it so let me know what you think. Thx

multimeric commented 4 years ago

Hi @kotarf, thanks for the PR. I can appreciate your use-case and would like to support it. However, considering that I'm in the middle of a pseudo-rewrite that will better support this need, I'm hesitant to work through a PR too much. That said, if you need this functionality soon, it doesn't look like a huge PR and I'm happy to review it.

kotarf commented 4 years ago

Hi @kotarf, thanks for the PR. I can appreciate your use-case and would like to support it. However, considering that I'm in the middle of a pseudo-rewrite that will better support this need, I'm hesitant to work through a PR too much. That said, if you need this functionality soon, it doesn't look like a huge PR and I'm happy to review it.

Nope, not really urgent. I also had to add a fix; the logic should be correct now.

I'd like to work against this re-write. Is it being worked on in #29?

multimeric commented 4 years ago

It is being worked on there, but it's all in so much flux I wouldn't recommend merging that branch in right now. Broadly speaking, the PR decouples Validations from Series, so your problem of optional Series this is naturally something that I'll have to solve anyway.

If it's not urgent, I'll check in here after I've finished that branch.

kotarf commented 4 years ago

It is being worked on there, but it's all in so much flux I wouldn't recommend merging that branch in right now. Broadly speaking, the PR decouples Validations from Series, so your problem of optional Series this is naturally something that I'll have to solve anyway.

If it's not urgent, I'll check in here after I've finished that branch.

That's fine, I actually have another potential feature that's slightly more involved. I'll try to get up to date with the rework and code against that. Let me know if this PR becomes irrelevant and I'll close it out.

whege commented 3 years ago

Hi, wanted to check on the status of the PR since I see it has been a while since any discussion and I currently have a use case for this feature in a project. If it is still being worked on, I am happy to help contribute if needed to help complete the PR.

multimeric commented 3 years ago

Unfortunately it's in the same place as before, because I've been quite busy. If you or anyone else wants to expedite this, you could try to implement this by writing a PR against my 1.0 branch: https://github.com/TMiguelT/PandasSchema/tree/bitwise.

Unfortunately I haven't written a comprehensive changes document, that will come later. However the docstrings are generally up-to-date, so that should help you in writing this. Basically the main changes in that branch that impacts this feature are:

So to implement the behaviour mentioned in the OP, you would have your ordinary validation set, and-ed together, then you would or it with a validator that passes when the column doesn't exist, or-ed with a validator that passes when all of the column is empty. e.g. (CanConvertValidation(int) & InRangeValidation(min=1, max=8)) | IsAbsentValidation() | AllEmptyValidation().

Then you have to implement the IsAbsentValidation() such that it returns True if the index doesn't exist in the DataFrame (yes really a scalar True, not a Series, we use numpy broadcasting logic to handle this), and also the AllEmptyValidation() which returns True if the entire Series is empty. Note, this differs from the existing IsEmptyValidation, which returns a boolean Series of True and False depending on whether each cell in that Series is empty.

Let me know if you have any issues or need any clarifications.