multimeric / PandasSchema

A validation library for Pandas data frames using user-friendly schemas
https://multimeric.github.io/PandasSchema/
GNU General Public License v3.0
189 stars 35 forks source link

support category dtypes #24

Closed caddac closed 5 years ago

caddac commented 5 years ago

Resolves #22

multimeric commented 5 years ago

Hi David, thanks for identifying and fixing this issue. I have two suggestions on how I'd like this implemented.

Firstly, it seems that the more elegant way to do this broad type checking is with the pandas.api.types functions, as explained here: https://stackoverflow.com/a/38185759/2148718. These seem to have existed since 0.19 (2016), so I'm happy to bump the pandas version requirement and use them. This would, for example, give us access to pandas.api.types.is_categorical_dtype. This would also stop pandas throwing an error if you try to check a categorical column, and instead it would just return false, which is ideal.

Secondly, I think we could improve your actual skip-empty behaviour for categoricals:

validated = (series.astype(object).str.len() > 0) & simple_validation

Currently you're converting the column to a string and checking for empty strings, but I don't believe this is the idiomatic way to treat categoricals. In theory the user might actually have the empty string as one of their categories, and so we don't want to ignore a column in this case. On the other hand, the documentation says "All values of categorical data are either in categories or np.nan", which makes me think that, like with numericals, we should treat only np.nan as an empty categorical column.

caddac commented 5 years ago

Cool, will implement these changes.

For the dtype check, are you thinking we should use the pandas type checks for all checks? i.e.

        if is_categorical_dtype(series):
            validated = (series.astype(object).str.len() > 0) & simple_validation
        elif is_numeric_dtype(series):
            validated = ~series.isna() & simple_validation
        else:
            validated = (series.str.len() > 0) & simple_validation

or just for the categorical check?

multimeric commented 5 years ago

Might as well do it in both cases. But also make sure to bump the pandas version in the setup.py also.

caddac commented 5 years ago

I'm not seeing a pandas version in the setup.py. Seems it just installing the latest?

multimeric commented 5 years ago

Yes, exactly. Which is why it needs a constraint

caddac commented 5 years ago

Allright, I think I've got this all wrapped up now. Ended up having to use pandas>=0.21 for the Series.isna() call which failed on pandas 0.19.

multimeric commented 5 years ago

Hmm, unless I use that method elsewhere, you might as well change it to isnull() (https://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.Series.isnull.html#pandas.Series.isnull), just so we can support the most versions of pandas possible.

caddac commented 5 years ago

OK, working with pandas 0.19!

caddac commented 5 years ago

For some reason the comments on this PR are completely out of order, I've submitted a ticket to Github. Very confusing...

multimeric commented 5 years ago

Excellent, looks good! I tweaked some of your docstrings, but nothing major.

For some reason the comments on this PR are completely out of order, I've submitted a ticket to Github. Very confusing...

Yep I also noticed that. Never seen it before...

multimeric commented 5 years ago

This is now available as a GitHub release and on PyPi as 0.3.4