multimeric / PandasSchema

A validation library for Pandas data frames using user-friendly schemas
https://multimeric.github.io/PandasSchema/
GNU General Public License v3.0
189 stars 35 forks source link

Added validation method IsTypeValidation #56

Open chrispijo opened 3 years ago

chrispijo commented 3 years ago

Proposed new method to solve the issue with the validation method IsDTypeValidation. This method also indicates which rows are not valid.

chrispijo commented 3 years ago

I do not yet know how to link this pull request to the existing issue. But it concerns issue #39.

multimeric commented 3 years ago

Hmm, I'm not sure I like this approach. Pandas series are inherently all the same type (unless dtype=object), so it's wasteful to check each element when we can instead just look at the dtype of the series. Even if you were looking at each element I would advise against using a loop, and opt for a vectorised operation.

chrispijo commented 3 years ago

I was indeed looking at every element such that it returns validation messages for the specific cells that fail. If a series is object, it is clear that there are one or more inconsistencies, but it would help if we'd know the specific cells. Why wouldn't you want to include that?

If adding this functionality to the package is not preferred, that is okay. We can define the class separately and add it to the Column()-object.

I know see that using bool_series = series.apply(type) == int is faster than a for-loop (4.6x in my test). Not sure how to compare series.apply(type) to a list (e.g. [int, float]) however.

multimeric commented 3 years ago

The justification for the feature is fine, but you would need to make it clear in the docstring that this validation only makes sense for an object series, you would need to check that the Series is an object series, and also as I said I would much prefer vectorised operations. Look at pandas.Series.isin for your use-case.

chrispijo commented 3 years ago

I looked at your vectorization remark but I do not know how to streamline it more. It is now encorporated as series.apply(type).isin(self.allowed_types). It is unclear how to replace the apply-loop with vectorization. The isin is probably already vectorization? The additional isin improved speed to 5.63x faster.

Besides that, if the series is of non-object dtype, it now 'redirects' to IsDTypeValidation. But I am not satisfied with the code. See also DISLIKE-remarks in-code.

chrispijo commented 3 years ago

A new version is pushed. I made IsTypeValidation how I think it is best (to my knowledge). IsDtypeValidation is changed to allow for multiple dtypes.

The test-file test_validation.py returns errors. Test-files are new for me. I've got to look into that after the weekend.

qotho commented 3 years ago

I tried using this fix and I replaced my use of IsDtypeValidation with IsTypeValidation. I left the argument as a string with the type name, as in IsTypeValidation('int64'). I got "TypeError: data type 'n' not understood" because IsTypeValidation is expecting a list of types, but I passed a string. The code iterated the string characters and errored when it hit 'n' in 'int64'. Maybe IsTypeValidation should check the argument type and if it is a string, wrap it in an array?

qotho commented 3 years ago

Also, this still seems to give me the same error I got before with the following example:

series = pd.Series([1,2,3,4]).astype('Int64')
v = IsDtypeValidation('Int64')
v.get_errors(series)

TypeError: Cannot interpret 'Int64Dtype' as a data type

Isn't this the original error this change was supposed to address (except for StringDtype), or did I misunderstand?

chrispijo commented 3 years ago

I tried using this fix and I replaced my use of IsDtypeValidation with IsTypeValidation. I left the argument as a string with the type name, as in IsTypeValidation('int64'). I got "TypeError: data type 'n' not understood" because IsTypeValidation is expecting a list of types, but I passed a string. The code iterated the string characters and errored when it hit 'n' in 'int64'. Maybe IsTypeValidation should check the argument type and if it is a string, wrap it in an array?

The validation method is meant to allow the normal Python built-in types (like str, float, int, bool), and thus be used as IsTypeValidation([int, float]) for instance.

I am not sure how to correctly check if provided list items in the argument is of the correct types. I will include the following:

if type(allowed_types) != list:
    raise PanSchArgumentError('The argument "allowed_types" passed to IsTypeValidation is not of type list. Provide a '
                              'list containing one or more of the Python built-in types "str", "int", "float" or '
                              '"bool".')

for allowed_type in allowed_types:
    if allowed_type not in [str, int, float, bool]:
        raise PanSchArgumentError('The item "{}" provided in the argument "allowed_types" as passed to '
                                  'IsTypeValidation is not of the correct type. Provide one of Python built-in types '
                                  '"str", "int", "float" or "bool".'.format(allowed_type))

The downside however is that these four are probably not all possible types in a dataframe. The latter could be replaced with if type(allowed_type) != type, but then list (as in IsTypeValidation([list])) would be a valid argument, as list is also an build-in Python type.

chrispijo commented 3 years ago

Also, this still seems to give me the same error I got before with the following example:

series = pd.Series([1,2,3,4]).astype('Int64')
v = IsDtypeValidation('Int64')
v.get_errors(series)

TypeError: Cannot interpret 'Int64Dtype' as a data type

Isn't this the original error this change was supposed to address (except for StringDtype), or did I misunderstand?

Yes you are right. It derailed somewhat to an alternative validation method.