multimeric / PandasSchema

A validation library for Pandas data frames using user-friendly schemas
https://multimeric.github.io/PandasSchema/
GNU General Public License v3.0
189 stars 35 forks source link

checking numbers in range and text at the same time #28

Open cnarvaa opened 4 years ago

cnarvaa commented 4 years ago

Hey, I'm very thankful for this library, it makes lots of things easier for us 😄 . Well, at this moment we have to validate that a cell contains "a, -, 4.5". And, if the cell has a number, that number must be between 0 and 10.

The problem is that, since we have some chars there, we can not use the range InRangeValidation. The error says Unable to parse string "-" at position 4. I was able to fix this using

def validate_score_number(number):
    valid_number = False
    try:
        valid_number = 0<=float(number)<=MAX_SCORE_VALUE
    except Exception as e:
        valid_number = False
    return valid_number
CustomElementValidation(validate_score_number, "")

It seems like a very straigth forward answer, but it works. Some advice? should I implement this and publish a PR?

multimeric commented 4 years ago

Hi @cnarvaa. This is mostly a by-product of the fact that I use vectorized operations wherever possible in PandasSchema, meaning that I can't look at each element individually to check if it's a number or not in the same operation. It also means I can't use your exact code because it's not vectorized.

The ideal way I'd want this to work is using an OrValidation, specifically:

InListValidation(['a', '-']) | InRangeValidation(0, MAX_SCORE)

However, it is a problem that the InRangeValidation throws an error here, because that prevents you using an OrValidation. What I (or you, if you have time) need to do is get the InRangeValidation to not fail if it encounters a non-numeric. I think it should use pd.to_numeric(errors='coerce') so that non-numerics are treated as nan when checking if it's in the range, which will always return false.

cnarvaa commented 4 years ago

I would also like to receive the exact reason why a value doesn't accomplish a rule when I use the boolean operators. The error I'm getting is this: (is not in the list of legal options (a, -)) <built-in function or_> (Score value should be between 0 and 10000)) I would like to get, in this case, that the value is not in the legal options.

I was checking the code but I don't get were to change that.

multimeric commented 4 years ago

Under the current version, you can probably do this:

v = InListValidation(['a', '-']) | InRangeValidation(0, MAX_SCORE)
v._custom_message = 'Your custom message'

However that's a bit ugly, and there should be a way of setting a custom message as part of the constructor. What I might aim to do is improve the _CombinedValidation constructor (which is the class that's implicitly created when you use the |) so that it takes a message, and you can do this:

_CombinedValidation(InListValidation(['a', '-']),  InRangeValidation(0, MAX_SCORE), operator.or_, message='Your custom message')

Then I'll have to actually add _CombinedValidation to the documentation, though it's currently hidden. I'll look into this once I've finished my large rewrite, but you if desperately need this fix soon then I would accept a PR.

cnarvaa commented 4 years ago

Well, the main idea is to know the exact reason (condition) why Validation is false. Using a generic message is not correct for us in this moment.

Maybe something in between the moment we get the first false? Let me check for that constructor.

multimeric commented 4 years ago

Yes, I understand you want a message that explains why the validation failed. But you can specify anything in the custom message:

v._custom_message = 'was not in the range (0, 100) and also was not one of the allowed characters "a" or "-"'

Maybe something in between the moment we get the first false?

Can you explain what you mean by this?

cnarvaa commented 4 years ago

Hey Miguel, So, imagine we are using 10 rules for a single column but even if only one rule is broke, all the rules are displayed.

We are getting this message as validation error (is not in the list of legal options (a, -)) <built-in function or_> (Score value should be between 0 and 10000)) Suppose we have a score = 11000. We want the message to say Score value should be between 0 and 10000, Without including all the possible validations.

Using a custom message, like the example you show me, is not pretty clear for the users. It's more clear if they know exactly what they did wrong. We are doing some tricks on our side but I think this could be a good feature.

I will check the _CombinedValidation function and then I can share an Idea.

Edit

I just checked the logic in _CombinedValidation.validate, it will be pretty hard to get the exact values that result in an error. The operator result is a Set, we can get the error indexes with something like ~operator_results and then get the v_a and v_b results for those indexes. Finally, we can save the necessary indexes and keep going on with new and complicated logic 😢.

Perhaps, even if we can set something like self.detailed_message to show what had failed, It will be very difficult in a case like (A | B) & (C & D) because for each comparison we will need to store some states or messages.

I don't know if you have any idea about this. By now, I'm letting it go!

multimeric commented 4 years ago

Ah okay, I see your point. You want the message to only mention validations that actually failed, not the others that passed. I agree with that, for what it's worth. Since it's a bit tricky to do, it's something I'm endeavouring to fix in the next major version, which is currently in the form of this PR: https://github.com/TMiguelT/PandasSchema/pull/29/files.

multimeric commented 4 years ago

Update: this is half fixed by #30, but it still returns both validation messages if only one validation fails, so this is still an ongoing issue.