Open randolf-scholz opened 1 year ago
Thanks for the report, looks like we should probably avoid factorizing then in pivot for dictionary pyarrow types. For reference, this is the traceback.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/thomasli/pandas/pandas/core/frame.py", line 8792, in pivot
return pivot(self, index=index, columns=columns, values=values)
File "/Users/thomasli/pandas/pandas/core/reshape/pivot.py", line 557, in pivot
multiindex = MultiIndex.from_arrays(index_list)
File "/Users/thomasli/pandas/pandas/core/indexes/multi.py", line 523, in from_arrays
codes, levels = factorize_from_iterables(arrays)
File "/Users/thomasli/pandas/pandas/core/arrays/categorical.py", line 2762, in factorize_from_iterables
codes, categories = zip(*(factorize_from_iterable(it) for it in iterables))
File "/Users/thomasli/pandas/pandas/core/arrays/categorical.py", line 2762, in <genexpr>
codes, categories = zip(*(factorize_from_iterable(it) for it in iterables))
File "/Users/thomasli/pandas/pandas/core/arrays/categorical.py", line 2735, in factorize_from_iterable
cat = Categorical(values, ordered=False)
File "/Users/thomasli/pandas/pandas/core/arrays/categorical.py", line 443, in __init__
codes, categories = factorize(values, sort=True)
File "/Users/thomasli/pandas/pandas/core/algorithms.py", line 743, in factorize
return values.factorize(sort=sort, use_na_sentinel=use_na_sentinel)
File "/Users/thomasli/pandas/pandas/core/base.py", line 1051, in factorize
codes, uniques = algorithms.factorize(
File "/Users/thomasli/pandas/pandas/core/algorithms.py", line 759, in factorize
codes, uniques = values.factorize(use_na_sentinel=use_na_sentinel)
File "/Users/thomasli/pandas/pandas/core/arrays/arrow/array.py", line 891, in factorize
encoded = data.dictionary_encode(null_encoding=null_encoding)
File "pyarrow/table.pxi", line 586, in pyarrow.lib.ChunkedArray.dictionary_encode
File "pyarrow/_compute.pyx", line 560, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 355, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function 'dictionary_encode' has no kernel matching input types (dictionary<values=string, indices=int32, ordered=0>)
take
For information, same type of issue , with example: On 16Gb PC: from a read_parquet with py_arrow engine, df pivots on 4 categorical types with one int as sum -> memory allocation overload.
_**<class 'pandas.core.frame.DataFrame'> Index: 3711683 entries, 14 to 9070 Data columns (total 5 columns): Column Dtype
0 ntusername category
1 user category
2 techno category
3 server category
4 calls_count int64
memory usage: 276.1+ MB
test= pd.pivot_table(df, values=['calls_count'], index=[ 'textdata_tables','user','ntusername','date'], aggfunc=np.sum).resetindex()**
=> MemoryError: Unable to allocate 10.8 GiB for an array with shape (8510, 682784) and data type int16
Saving in csv , reloading (pass categories in object) and executing the pivot returns valid pivot in a second.
Fixed as of #53232 I think.
With current libraries, (arrow=16.1, pandas=2.2.2, numpy=2.0.0) the example in my OP fails with
AttributeError: 'Series' object has no attribute '_pa_array'
@lithomas1 The bug seems to be this line:
The issue is that values can be one of (ABCIndex, ABCSeries, ExtensionArray), but not all of these have the _pa_array
attribute.
I wrote a naive patch here: #59099
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
This bug is caused by https://github.com/apache/arrow/issues/34890: currently, pyarrow's
dictionary_encode
is not idempotent, i.e. it fails withArrowNotImplementedError
instead of returning the data as-is if the array is already of dictionary type.So, either one waits until it is fixed upstream by pyarrow, or an additional check needs to be added to test whether the series is already of dictionary data type.
Expected Behavior
Pivot should work with categorical data when using the pyarrow backend.
Installed Versions