pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.87k stars 18.02k forks source link

BUG: `KeyError: DictionaryType` when converting to/from PyArrow backend #57395

Open ajfriend opened 9 months ago

ajfriend commented 9 months ago

Pandas version checks

Reproducible Example

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import seaborn as sns

df = sns.load_dataset('diamonds')
table = pa.Table.from_pandas(df)
df2 = table.to_pandas(types_mapper=pd.ArrowDtype)
df2.convert_dtypes(dtype_backend='numpy_nullable')

Issue Description

I get an error KeyError: DictionaryType(dictionary<values=string, indices=int8, ordered=0>) with a traceback like the following:

Click me ``` --------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[27], line 9 7 table = pa.Table.from_pandas(df) 8 df2 = table.to_pandas(types_mapper=pd.ArrowDtype) ----> 9 df2.convert_dtypes(dtype_backend='numpy_nullable') File [~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/generic.py:7025](http://localhost:8888/lab/tree/~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/generic.py#line=7024), in NDFrame.convert_dtypes(self, infer_objects, convert_string, convert_integer, convert_boolean, convert_floating, dtype_backend) 6896 """ 6897 Convert columns to the best possible dtypes using dtypes supporting ``pd.NA``. 6898 (...) 7022 dtype: string 7023 """ 7024 check_dtype_backend(dtype_backend) -> 7025 new_mgr = self._mgr.convert_dtypes( # type: ignore[union-attr] 7026 infer_objects=infer_objects, 7027 convert_string=convert_string, 7028 convert_integer=convert_integer, 7029 convert_boolean=convert_boolean, 7030 convert_floating=convert_floating, 7031 dtype_backend=dtype_backend, 7032 ) 7033 res = self._constructor_from_mgr(new_mgr, axes=new_mgr.axes) 7034 return res.__finalize__(self, method="convert_dtypes") File [~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/internals/managers.py:456](http://localhost:8888/lab/tree/~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/internals/managers.py#line=455), in BaseBlockManager.convert_dtypes(self, **kwargs) 453 else: 454 copy = True --> 456 return self.apply( 457 "convert_dtypes", copy=copy, using_cow=using_copy_on_write(), **kwargs 458 ) File [~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/internals/managers.py:364](http://localhost:8888/lab/tree/~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/internals/managers.py#line=363), in BaseBlockManager.apply(self, f, align_keys, **kwargs) 362 applied = b.apply(f, **kwargs) 363 else: --> 364 applied = getattr(b, f)(**kwargs) 365 result_blocks = extend_blocks(applied, result_blocks) 367 out = type(self).from_blocks(result_blocks, self.axes) File [~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/internals/blocks.py:694](http://localhost:8888/lab/tree/~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/internals/blocks.py#line=693), in Block.convert_dtypes(self, copy, using_cow, infer_objects, convert_string, convert_integer, convert_boolean, convert_floating, dtype_backend) 691 for blk in blks: 692 # Determine dtype column by column 693 sub_blks = [blk] if blk.ndim == 1 or self.shape[0] == 1 else blk._split() --> 694 dtypes = [ 695 convert_dtypes( 696 b.values, 697 convert_string, 698 convert_integer, 699 convert_boolean, 700 convert_floating, 701 infer_objects, 702 dtype_backend, 703 ) 704 for b in sub_blks 705 ] 706 if all(dtype == self.dtype for dtype in dtypes): 707 # Avoid block splitting if no dtype changes 708 rbs.append(blk.copy(deep=copy)) File [~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/internals/blocks.py:695](http://localhost:8888/lab/tree/~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/internals/blocks.py#line=694), in (.0) 691 for blk in blks: 692 # Determine dtype column by column 693 sub_blks = [blk] if blk.ndim == 1 or self.shape[0] == 1 else blk._split() 694 dtypes = [ --> 695 convert_dtypes( 696 b.values, 697 convert_string, 698 convert_integer, 699 convert_boolean, 700 convert_floating, 701 infer_objects, 702 dtype_backend, 703 ) 704 for b in sub_blks 705 ] 706 if all(dtype == self.dtype for dtype in dtypes): 707 # Avoid block splitting if no dtype changes 708 rbs.append(blk.copy(deep=copy)) File [~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/dtypes/cast.py:1150](http://localhost:8888/lab/tree/~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/dtypes/cast.py#line=1149), in convert_dtypes(input_array, convert_string, convert_integer, convert_boolean, convert_floating, infer_objects, dtype_backend) 1147 inferred_dtype = ArrowDtype(pa_type) 1148 elif dtype_backend == "numpy_nullable" and isinstance(inferred_dtype, ArrowDtype): 1149 # GH 53648 -> 1150 inferred_dtype = _arrow_dtype_mapping()[inferred_dtype.pyarrow_dtype] 1152 # error: Incompatible return value type (got "Union[str, Union[dtype[Any], 1153 # ExtensionDtype]]", expected "Union[dtype[Any], ExtensionDtype]") 1154 return inferred_dtype KeyError: DictionaryType(dictionary) ```

Expected Behavior

I would expect df2.convert_dtypes() to run without error and return a DataFrame.

Installed Versions

INSTALLED VERSIONS ------------------ commit : fd3f57170aa1af588ba877e8e28c158a20a4886d python : 3.11.2.final.0 python-bits : 64 OS : Darwin OS-release : 23.2.0 Version : Darwin Kernel Version 23.2.0: Wed Nov 15 21:55:06 PST 2023; root:xnu-10002.61.3~2/RELEASE_ARM64_T6020 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.0 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.8.2 setuptools : 69.1.0 pip : 24.0 Cython : None pytest : 8.0.0 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.21.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 15.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
rhshadrach commented 9 months ago

Thanks for the report. I'm not certain this shouldn't raise. Besides raising, it seems to me the only reasonable results are to either not convert or convert to NumPy object dtype. But neither of these are NumPy-nullable dtypes. That suggests to me the operation cannot be performed, i.e. should raise.

cc @jorisvandenbossche

ajfriend commented 9 months ago

I see. So this might just be user error on my part due to lack of understanding.

For context, my broader goal is to define a "canonical" backend and some conversion functions to/from PyArrow/Parquet such that I can do a round trip from DataFrame --> Pyarrow/Parquet --> DataFrame and get the same types back that I started with, and then test the dataframes for equality.

It wasn't immediately clear to me which collection of types_mapper=pd.ArrowDtype, dtype_backend='numpy_nullable', and dtype_backend='numpy_nullable' I should be using.

mroeschke commented 9 months ago

Generally when using Arrow types, it's important to realize there are Arrow types (like dictionary) that don't map to Numpy types. See the table in https://pandas.pydata.org/docs/reference/arrays.html#pyarrow

If your goal to validate a round-trip, it's best to round trip within Arrow types or within Numpy types. As seen in the table in the link above, the only strong parallel between the types are numeric types.

jorisvandenbossche commented 9 months ago

That suggests to me the operation cannot be performed, i.e. should raise.

Or, ignore the column? (and leave it as is)

I agree there is no operation that can be done (in the "nullable dtypes" realm, we don't yet have a categorical / dictionary type), but there are other cases where this also occurs, eg for our default categorical, or datetime64 or interval dtypes. Those dtypes are also not considered as "numpy_nullable", but such columns are left intact.

jorisvandenbossche commented 9 months ago

such that I can do a round trip from DataFrame --> Pyarrow/Parquet --> DataFrame and get the same types back that I started with, and then test the dataframes for equality.

@ajfriend In general, not specifying any specific keyword (like types_mapper or dtype_backend) should work. When converting pandas to pyarrow, some metadata about the original dtypes is stored, which is used to try to restore the original dataframe as best as possible when converting back to pandas.

ajfriend commented 9 months ago

Thanks. This background helps.

For my current project, I am able pick what format I want the "original" dataframe to be in (the baseline that I would use for comparison), so I'm trying to research what the most stable types/backend would be, knowing that I'll be storing them as parquet files (but open to alternatives) and with an eye towards what the future of Pandas/Arrow will look like.

At a high level, I think I'm just trying to wrap my head around when I should be using Arrow types vs "classic pandas" types, and how they all fit together. Is it fair to say that these details are still being worked out as of Pandas 2.0, and there should be more obvious mappings between pandas types and arrow/parquet types in Pandas 3.0? Perhaps the answer is that I should just be looking forward to 3.0.

Right now, I don't have a great mental model to help me expect what the output of a conversion will be, or what the correct default type should be, if such a thing even exists.

For example, if I have data like

df = pd.DataFrame({
    'colA': [1, 2, 3, None],
    'colB': ['A', 'B', 'C', None],
})

and I use various combinations of transforms like .astype('category'), pa.Table.from_pandas(df).to_pandas(), convert_dtypes(), with combinations of options dtype_backend and types_mapper, I'll can end up with types like

object string string[pyarrow] category dictionary<values=string, indices=int8, ordered=0>[pyarrow]

and

float64 Float64 Int64 int64[pyarrow] double[pyarrow]

That's a lot of options. I understand the differences between these types and their use cases, but I'm still not sure if I should be using, for example, category, dictionary<values=string, indices=int8, ordered=0>[pyarrow], string, or string[pyarrow]. Or when I should expect a float vs a nullable int. Or when to expect an error like what I posed above (I also like the idea to ignore the column to avoid the error.)

Maybe the answer for now is that I just need to pick one and be careful about the conversions, but I'm hoping for a future where a lot of that is taken care of for me with reasonable defaults.

But I'm sure that its a very tough problem for you all to figure out defaults that work well for most use cases, so I appreciate all the work you're putting in!

ajfriend commented 9 months ago

To phrase my problem another way, is there a standard way to "canonicalize" a Pandas dataframe to use PyArrow types? I'm looking for something idempotent and would be robust to roundtrips.

Here are two attempts:

def canon1(df): # works, but doesn't use pyarrow categorical
    df = df.convert_dtypes(dtype_backend='pyarrow')
    table = pa.Table.from_pandas(df)
    df = table.to_pandas()
    df = df.convert_dtypes(dtype_backend='pyarrow')
    return df

def canon2(df): # not idempotent; gives error
    df = df.convert_dtypes(dtype_backend='pyarrow')
    table = pa.Table.from_pandas(df)
    df = table.to_pandas(types_mapper=pd.ArrowDtype)
    return df

With data like

df = pd.DataFrame({
    'colA': [1, 2, 3, None],
    'colB': ['A', 'B', 'C', None],
    'colC': ['aa', 'bb', 'cc', None],
})
df['colC'] = df['colC'].astype('category')

I end up with types

>>> canon1(df).dtypes
colA     int64[pyarrow]
colB    string[pyarrow]
colC           category

>>> canon2(df).dtypes
colA                                       int64[pyarrow]
colB                                      string[pyarrow]
colC    dictionary<values=string, indices=int8, ordere...

A few notes/questions:

Is there a better way to do this? Does this idea of "PyArrow canonicalization" even make sense?

(Also, I may have strayed from the original question, but I'm happy to split these out into separate issues if that seems worthwhile.)