unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

Not compatible with Pandas 2.0 #1101

Closed Filimoa closed 1 year ago

Filimoa commented 1 year ago

Describe the bug Pandera is not currently compatible with the upcoming release of pandas if the user has a version of Dask installed below 2023.2.1.

Code Sample

Running

!pip install pandas==2.0.0rc0
!pip install pandera

from pandera import SchemaModel

Fails with the following error

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[1], line 23
     20 sys.path.append(".[./../..](https://file+.vscode-resource.vscode-cdn.net/Users/sergey/Coding/business/three-sigma-etl/src/)")
     21 warnings.filterwarnings("ignore")
---> 23 from src.transform import load, consts
     24 from src.transform.schemas.output import Base
     25 from src.risk_score.core import hexagons

File [~/Coding/business/three-sigma-etl/src/risk_score/notebooks/../../../src/transform/load.py:9](https://file+.vscode-resource.vscode-cdn.net/Users/sergey/Coding/business/three-sigma-etl/src/risk_score/notebooks/~/Coding/business/three-sigma-etl/src/transform/load.py:9)
      7 import pandas as pd
      8 from loguru import logger
----> 9 from pandera import SchemaModel
     11 from src.transform import consts
     12 from src.transform import schemas

File [~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/pandera/__init__.py:50](https://file+.vscode-resource.vscode-cdn.net/Users/sergey/Coding/business/three-sigma-etl/src/risk_score/notebooks/~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/pandera/__init__.py:50)
     33 from pandera.engines.numpy_engine import Object
     34 from pandera.engines.pandas_engine import (
     35     BOOL,
     36     INT8,
   (...)
     47     pandas_version,
     48 )
---> 50 from . import errors, pandas_accessor, typing
     51 from .checks import Check
     52 from .decorators import check_input, check_io, check_output, check_types

File [~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/pandera/typing/__init__.py:9](https://file+.vscode-resource.vscode-cdn.net/Users/sergey/Coding/business/three-sigma-etl/src/risk_score/notebooks/~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/pandera/typing/__init__.py:9)
      1 """Typing module.
      2 
      3 For backwards compatibility, pandas types are exposed to the top-level scope of
      4 the typing module.
      5 """
      7 from typing import Set, Type
----> 9 from . import dask, fastapi, geopandas, modin, pyspark
     10 from .common import (
     11     BOOL,
     12     INT8,
   (...)
     42     UInt64,
     43 )
     44 from .pandas import DataFrame, Index, Series

File [~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/pandera/typing/dask.py:9](https://file+.vscode-resource.vscode-cdn.net/Users/sergey/Coding/business/three-sigma-etl/src/risk_score/notebooks/~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/pandera/typing/dask.py:9)
      6 from .pandas import GenericDtype, Schema
      8 try:
----> 9     import dask.dataframe as dd
     11     DASK_INSTALLED = True
     12 except ImportError:

File [~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/dask/dataframe/__init__.py:4](https://file+.vscode-resource.vscode-cdn.net/Users/sergey/Coding/business/three-sigma-etl/src/risk_score/notebooks/~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/dask/dataframe/__init__.py:4)
      2 import dask.dataframe._pyarrow_compat
      3 from dask.base import compute
----> 4 from dask.dataframe import backends, dispatch, rolling
      5 from dask.dataframe.core import (
      6     DataFrame,
      7     Index,
   (...)
     13     to_timedelta,
     14 )
     15 from dask.dataframe.groupby import Aggregation

File [~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/dask/dataframe/backends.py:21](https://file+.vscode-resource.vscode-cdn.net/Users/sergey/Coding/business/three-sigma-etl/src/risk_score/notebooks/~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/dask/dataframe/backends.py:21)
     19 from dask.array.percentile import _percentile
     20 from dask.backends import CreationDispatch, DaskBackendEntrypoint
---> 21 from dask.dataframe.core import DataFrame, Index, Scalar, Series, _Frame
     22 from dask.dataframe.dispatch import (
     23     categorical_dtype_dispatch,
     24     concat,
   (...)
     36     union_categoricals_dispatch,
     37 )
     38 from dask.dataframe.extensions import make_array_nonempty, make_scalar

File [~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/dask/dataframe/core.py:35](https://file+.vscode-resource.vscode-cdn.net/Users/sergey/Coding/business/three-sigma-etl/src/risk_score/notebooks/~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/dask/dataframe/core.py:35)
     33 from dask.blockwise import Blockwise, BlockwiseDep, BlockwiseDepDict, blockwise
     34 from dask.context import globalmethod
---> 35 from dask.dataframe import methods
     36 from dask.dataframe._compat import (
     37     PANDAS_GT_140,
     38     PANDAS_GT_150,
     39     check_numeric_only_deprecation,
     40 )
     41 from dask.dataframe.accessor import CachedAccessor, DatetimeAccessor, StringAccessor

File [~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/dask/dataframe/methods.py:22](https://file+.vscode-resource.vscode-cdn.net/Users/sergey/Coding/business/three-sigma-etl/src/risk_score/notebooks/~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/dask/dataframe/methods.py:22)
     10 #  preserve compatibility while moving dispatch objects
     11 from dask.dataframe.dispatch import (  # noqa: F401
     12     concat,
     13     concat_dispatch,
   (...)
     20     union_categoricals,
     21 )
---> 22 from dask.dataframe.utils import is_dataframe_like, is_index_like, is_series_like
     24 # cuDF may try to import old dispatch functions
     25 hash_df = hash_object_dispatch

File [~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/dask/dataframe/utils.py:19](https://file+.vscode-resource.vscode-cdn.net/Users/sergey/Coding/business/three-sigma-etl/src/risk_score/notebooks/~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/dask/dataframe/utils.py:19)
     17 from dask.base import get_scheduler, is_dask_collection
     18 from dask.core import get_deps
---> 19 from dask.dataframe import (  # noqa: F401 register pandas extension types
     20     _dtypes,
     21     methods,
     22 )
     23 from dask.dataframe._compat import PANDAS_GT_110, PANDAS_GT_120, tm  # noqa: F401
     24 from dask.dataframe.dispatch import (  # noqa : F401
     25     make_meta,
     26     make_meta_obj,
     27     meta_nonempty,
     28 )

File [~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/dask/dataframe/_dtypes.py:3](https://file+.vscode-resource.vscode-cdn.net/Users/sergey/Coding/business/three-sigma-etl/src/risk_score/notebooks/~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/dask/dataframe/_dtypes.py:3)
      1 import pandas as pd
----> 3 from dask.dataframe.extensions import make_array_nonempty, make_scalar
      6 @make_array_nonempty.register(pd.DatetimeTZDtype)
      7 def _(dtype):
      8     return pd.array([pd.Timestamp(1), pd.NaT], dtype=dtype)

File [~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/dask/dataframe/extensions.py:6](https://file+.vscode-resource.vscode-cdn.net/Users/sergey/Coding/business/three-sigma-etl/src/risk_score/notebooks/~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/dask/dataframe/extensions.py:6)
      1 """
      2 Support for pandas ExtensionArray in dask.dataframe.
      3 
      4 See :ref:`extensionarrays` for more.
      5 """
----> 6 from dask.dataframe.accessor import (
      7     register_dataframe_accessor,
      8     register_index_accessor,
      9     register_series_accessor,
     10 )
     11 from dask.utils import Dispatch
     13 make_array_nonempty = Dispatch("make_array_nonempty")

File [~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/dask/dataframe/accessor.py:190](https://file+.vscode-resource.vscode-cdn.net/Users/sergey/Coding/business/three-sigma-etl/src/risk_score/notebooks/~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/dask/dataframe/accessor.py:190)
    129     _accessor_methods = (
    130         "asfreq",
    131         "ceil",
   (...)
    145         "tz_localize",
    146     )
    148     _accessor_properties = (
    149         "components",
    150         "date",
   (...)
    186         "year",
    187     )
--> 190 class StringAccessor(Accessor):
    191     """Accessor object for string properties of the Series values.
    192 
    193     Examples
   (...)
    196     >>> s.str.lower()  # doctest: +SKIP
    197     """
    199     _accessor_name = "str"

File [~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/dask/dataframe/accessor.py:276](https://file+.vscode-resource.vscode-cdn.net/Users/sergey/Coding/business/three-sigma-etl/src/risk_score/notebooks/~/.pyenv/versions/3.10.8/envs/crash-data-extraction-3.10/lib/python3.10/site-packages/dask/dataframe/accessor.py:276), in StringAccessor()
    272         meta = (self._series.name, object)
    273     return self._function_map(method, pat=pat, n=n, expand=expand, meta=meta)
    275 @derived_from(
--> 276     pd.core.strings.StringMethods,
    277     inconsistencies="``expand=True`` with unknown ``n`` will raise a ``NotImplementedError``",
    278 )
    279 def split(self, pat=None, n=-1, expand=False):
    280     """Known inconsistencies: ``expand=True`` with unknown ``n`` will raise a ``NotImplementedError``."""
    281     return self._split("split", pat=pat, n=n, expand=expand)

AttributeError: module 'pandas.core.strings' has no attribute 'StringMethods'

After some testing looks like any version above 2023.2.1 will not have this issue.

!pip install pandas==2.0.0rc0
!pip install pandera
!pip install dask==2023.2.1

from pandera import SchemaModel

Desktop (please complete the following information):

Additional context

cosmicBboy commented 1 year ago

hi @Filimoa thanks for the bug report! Will use this issue to track progress for supporting pandas 2.0 in CI. This should also include handling the dask dependency in some way.

Did you already have dask installed in your Python environment?

smc77 commented 1 year ago

Just to note: I had the same issue right now, and I already had dask installed in my environment (independently of pandera).

Filimoa commented 1 year ago

I tested this on my own machine and google colab and looks like yes in both cases dask is already installed.

cosmicBboy commented 1 year ago

Cool, gonna add some version checking for pandas and dask versions and will raise an error on import

cosmicBboy commented 1 year ago

pandera 0.15.0 now supports pandas 2+. Support for dask, modin, and pyspark is not tested, and will still need to work on that and add a ci matrix of support for these additional frameworks. Also potentially pinning or better-documenting compatibility with these other frameworks.