Open ajfriend opened 9 months ago
Thanks for the report. I'm not certain this shouldn't raise. Besides raising, it seems to me the only reasonable results are to either not convert or convert to NumPy object dtype. But neither of these are NumPy-nullable dtypes. That suggests to me the operation cannot be performed, i.e. should raise.
cc @jorisvandenbossche
I see. So this might just be user error on my part due to lack of understanding.
For context, my broader goal is to define a "canonical" backend and some conversion functions to/from PyArrow/Parquet such that I can do a round trip from DataFrame --> Pyarrow/Parquet --> DataFrame and get the same types back that I started with, and then test the dataframes for equality.
It wasn't immediately clear to me which collection of types_mapper=pd.ArrowDtype
, dtype_backend='numpy_nullable'
, and dtype_backend='numpy_nullable'
I should be using.
Generally when using Arrow types, it's important to realize there are Arrow types (like dictionary) that don't map to Numpy types. See the table in https://pandas.pydata.org/docs/reference/arrays.html#pyarrow
If your goal to validate a round-trip, it's best to round trip within Arrow types or within Numpy types. As seen in the table in the link above, the only strong parallel between the types are numeric types.
That suggests to me the operation cannot be performed, i.e. should raise.
Or, ignore the column? (and leave it as is)
I agree there is no operation that can be done (in the "nullable dtypes" realm, we don't yet have a categorical / dictionary type), but there are other cases where this also occurs, eg for our default categorical, or datetime64 or interval dtypes. Those dtypes are also not considered as "numpy_nullable", but such columns are left intact.
such that I can do a round trip from DataFrame --> Pyarrow/Parquet --> DataFrame and get the same types back that I started with, and then test the dataframes for equality.
@ajfriend In general, not specifying any specific keyword (like types_mapper
or dtype_backend
) should work. When converting pandas to pyarrow, some metadata about the original dtypes is stored, which is used to try to restore the original dataframe as best as possible when converting back to pandas.
Thanks. This background helps.
For my current project, I am able pick what format I want the "original" dataframe to be in (the baseline that I would use for comparison), so I'm trying to research what the most stable types/backend would be, knowing that I'll be storing them as parquet files (but open to alternatives) and with an eye towards what the future of Pandas/Arrow will look like.
At a high level, I think I'm just trying to wrap my head around when I should be using Arrow types vs "classic pandas" types, and how they all fit together. Is it fair to say that these details are still being worked out as of Pandas 2.0, and there should be more obvious mappings between pandas types and arrow/parquet types in Pandas 3.0? Perhaps the answer is that I should just be looking forward to 3.0.
Right now, I don't have a great mental model to help me expect what the output of a conversion will be, or what the correct default type should be, if such a thing even exists.
For example, if I have data like
df = pd.DataFrame({
'colA': [1, 2, 3, None],
'colB': ['A', 'B', 'C', None],
})
and I use various combinations of transforms like .astype('category')
, pa.Table.from_pandas(df).to_pandas()
, convert_dtypes()
, with combinations of options dtype_backend
and types_mapper
, I'll can end up with types like
object
string
string[pyarrow]
category
dictionary<values=string, indices=int8, ordered=0>[pyarrow]
and
float64
Float64
Int64
int64[pyarrow]
double[pyarrow]
That's a lot of options. I understand the differences between these types and their use cases, but I'm still not sure if I should be using, for example, category
, dictionary<values=string, indices=int8, ordered=0>[pyarrow]
, string
, or string[pyarrow]
. Or when I should expect a float vs a nullable int. Or when to expect an error like what I posed above (I also like the idea to ignore the column to avoid the error.)
Maybe the answer for now is that I just need to pick one and be careful about the conversions, but I'm hoping for a future where a lot of that is taken care of for me with reasonable defaults.
But I'm sure that its a very tough problem for you all to figure out defaults that work well for most use cases, so I appreciate all the work you're putting in!
To phrase my problem another way, is there a standard way to "canonicalize" a Pandas dataframe to use PyArrow types? I'm looking for something idempotent and would be robust to roundtrips.
Here are two attempts:
def canon1(df): # works, but doesn't use pyarrow categorical
df = df.convert_dtypes(dtype_backend='pyarrow')
table = pa.Table.from_pandas(df)
df = table.to_pandas()
df = df.convert_dtypes(dtype_backend='pyarrow')
return df
def canon2(df): # not idempotent; gives error
df = df.convert_dtypes(dtype_backend='pyarrow')
table = pa.Table.from_pandas(df)
df = table.to_pandas(types_mapper=pd.ArrowDtype)
return df
With data like
df = pd.DataFrame({
'colA': [1, 2, 3, None],
'colB': ['A', 'B', 'C', None],
'colC': ['aa', 'bb', 'cc', None],
})
df['colC'] = df['colC'].astype('category')
I end up with types
>>> canon1(df).dtypes
colA int64[pyarrow]
colB string[pyarrow]
colC category
>>> canon2(df).dtypes
colA int64[pyarrow]
colB string[pyarrow]
colC dictionary<values=string, indices=int8, ordere...
A few notes/questions:
canon1()
(df = df.convert_dtypes(dtype_backend='pyarrow')
) I get slightly different outputs. Either pandas.core.dtypes.dtypes.ArrowDtype
or pandas.core.arrays.string_.StringDtype
, and it isn't immediately clear if there's a significant difference here. The string representation of the StringDtype
type is different under df.info()
and df.dtypes
: string
vs. string[pyarrow]
canon1()
seems to work ok and seems to be idempotent, but doesn't use pyarrow for the categorical type :(canon2()
isn't idempotent; I get a ValueError: format number 1 of "dictionary<values=string, indices=int8, ordered=0>[pyarrow]" is not recognized
pa.Table.from_pandas(canon2(df)).to_pandas()
. I get the same ValueError: format number 1 of "dictionary<values=string, indices=int8, ordered=0>[pyarrow]" is not recognized
Is there a better way to do this? Does this idea of "PyArrow canonicalization" even make sense?
(Also, I may have strayed from the original question, but I'm happy to split these out into separate issues if that seems worthwhile.)
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I get an error
KeyError: DictionaryType(dictionary<values=string, indices=int8, ordered=0>)
with a traceback like the following:Click me
``` --------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[27], line 9 7 table = pa.Table.from_pandas(df) 8 df2 = table.to_pandas(types_mapper=pd.ArrowDtype) ----> 9 df2.convert_dtypes(dtype_backend='numpy_nullable') File [~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/generic.py:7025](http://localhost:8888/lab/tree/~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/generic.py#line=7024), in NDFrame.convert_dtypes(self, infer_objects, convert_string, convert_integer, convert_boolean, convert_floating, dtype_backend) 6896 """ 6897 Convert columns to the best possible dtypes using dtypes supporting ``pd.NA``. 6898 (...) 7022 dtype: string 7023 """ 7024 check_dtype_backend(dtype_backend) -> 7025 new_mgr = self._mgr.convert_dtypes( # type: ignore[union-attr] 7026 infer_objects=infer_objects, 7027 convert_string=convert_string, 7028 convert_integer=convert_integer, 7029 convert_boolean=convert_boolean, 7030 convert_floating=convert_floating, 7031 dtype_backend=dtype_backend, 7032 ) 7033 res = self._constructor_from_mgr(new_mgr, axes=new_mgr.axes) 7034 return res.__finalize__(self, method="convert_dtypes") File [~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/internals/managers.py:456](http://localhost:8888/lab/tree/~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/internals/managers.py#line=455), in BaseBlockManager.convert_dtypes(self, **kwargs) 453 else: 454 copy = True --> 456 return self.apply( 457 "convert_dtypes", copy=copy, using_cow=using_copy_on_write(), **kwargs 458 ) File [~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/internals/managers.py:364](http://localhost:8888/lab/tree/~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/internals/managers.py#line=363), in BaseBlockManager.apply(self, f, align_keys, **kwargs) 362 applied = b.apply(f, **kwargs) 363 else: --> 364 applied = getattr(b, f)(**kwargs) 365 result_blocks = extend_blocks(applied, result_blocks) 367 out = type(self).from_blocks(result_blocks, self.axes) File [~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/internals/blocks.py:694](http://localhost:8888/lab/tree/~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/internals/blocks.py#line=693), in Block.convert_dtypes(self, copy, using_cow, infer_objects, convert_string, convert_integer, convert_boolean, convert_floating, dtype_backend) 691 for blk in blks: 692 # Determine dtype column by column 693 sub_blks = [blk] if blk.ndim == 1 or self.shape[0] == 1 else blk._split() --> 694 dtypes = [ 695 convert_dtypes( 696 b.values, 697 convert_string, 698 convert_integer, 699 convert_boolean, 700 convert_floating, 701 infer_objects, 702 dtype_backend, 703 ) 704 for b in sub_blks 705 ] 706 if all(dtype == self.dtype for dtype in dtypes): 707 # Avoid block splitting if no dtype changes 708 rbs.append(blk.copy(deep=copy)) File [~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/internals/blocks.py:695](http://localhost:8888/lab/tree/~/work/keepdb/env/lib/python3.11/site-packages/pandas/core/internals/blocks.py#line=694), inExpected Behavior
I would expect
df2.convert_dtypes()
to run without error and return a DataFrame.Installed Versions