multimeric / PandasSchema

A validation library for Pandas data frames using user-friendly schemas
https://multimeric.github.io/PandasSchema/
GNU General Public License v3.0
189 stars 35 forks source link

IsDtypeValidation-issue for pandas StringDtype #39

Open chrispijo opened 3 years ago

chrispijo commented 3 years ago

Hi. I am trying to confirm if all values in a Pandas-column are off type string. Doing this with IsDtypeValidation returns the error TypeError: Cannot interpret 'StringDtype' as a data type'. I made a topic on StackOverflow, and based on the comments I suspect that this might actually be in error in the IsDtypeValidation-class.

Is this an error? Or do I misuse the class/package?

import numpy as np
import pandas as pd
from pandas_schema.validation import IsDtypeValidation

series = pd.Series(["a", "b", "c"])

# Works as expected:
#   Returns a validation warning as the series is of dtype 'object' and not 'string'.
print(f"dtype = {series.dtypes}")  # Returns: dtype = object
idv = IsDtypeValidation(dtype=np.dtype(np.str))
validation_warnings = idv.get_errors(series=series)
print(validation_warnings[0])  # Returns: The column  has a dtype of object which is not a subclass of the required type <U0

# But we know that the series only contains string-values. Thus convert_dtypes() below.
# Does not work as expected:
#   Returns an error and traceback with 'TypeError: Cannot interpret 'StringDtype' as a data type'.
#   Expected output should be no error or validation warning.
series = series.convert_dtypes()
print(f"dtype = {series.dtypes}")  # Returns: dtype = string
idv = IsDtypeValidation(dtype=np.dtype(np.str))
validation_warnings = idv.get_errors(series=series)  # Error occurs in this line: 'TypeError: Cannot interpret 'StringDtype' as a data type'

Besides that, awesome work! Really handy package.

multimeric commented 3 years ago

Hmm, so this comes down to the fact that:

>>>import pandas as pd
>>>import numpy as np
>>> np.dtype(str)
dtype('<U')
>>> pd.StringDtype()
StringDtype
>>> np.issubdtype(np.dtype(str), pd.StringDtype())
TypeError: Cannot interpret 'StringDtype' as a data type

However, I'm not actually sure why this is the case. I would have thought an official Pandas Dtype extension would be compatible with the numpy API. I will look into it, but I'm happy to hear your input on how this should be implemented.

chrispijo commented 3 years ago

Ok. Good to know. Thanks for the quick response.