intake / akimbo

For when your data won't fit in your dataframe
https://akimbo.readthedocs.io
BSD 3-Clause "New" or "Revised" License
21 stars 6 forks source link

BUG: iterrows() on an awkward pandas column with equal-length rows results in a ValueError #55

Open Girmii opened 1 month ago

Girmii commented 1 month ago

Reproducible Example

import awkward as ak
import awkward_pandas as akpd
import pandas as pd

# numbers = [[1, 2, 3], [4, 5], [6]]  # no error
numbers = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]  # error
letters = ["A", "B", "C"]

numbers_ak = ak.from_iter(numbers)
numbers_akpd = akpd.from_awkward(numbers_ak)

df = pd.DataFrame({"letters": letters, "numbers": numbers_akpd})

for idx, row in df.iterrows():
    print(f"{idx} - {row['letters']}, {row['numbers']}")
File .venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:2253, in EABackedBlock.get_values(self, dtype)
   [2251](.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:2251)     values = values.astype(object)
   [2252](.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:2252) # TODO(EA2D): reshape not needed with 2D EAs
-> [2253](.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:2253) return np.asarray(values).reshape(self.shape)

ValueError: cannot reshape array of size 9 into shape

Issue Description

I reported this issue at the Pandas repository, but they referred me to here first to verify that it is not an error with awkward_pandas. (see Pandas issue 58927)


When calling iterrows() on a DataFrame which contains an awkward array as a column, a ValueError occurs (see stacktrace example). This error only occurs when all rows of the awkward array are of equal length. In this case the calls to values.astype(object) and/or np.asarray(values) in the get_values function in the pandas/core/internals/blocks.py module result in a 2D array, instead of a 1D array with nested lists. When the awkward array is actually jagged, the call results in the correct format of the array (see commented line in code example) and iterrows() works as intended.

Expected Behavior

I would expect iterrows() to iterate over the DataFrame rows without throwing an error, but instead returning a Series with the value of the awkward array at the index of the row set correctly.

Installed Versions

awkward         2.6.5
awkward_pandas  2023.8.0
numpy           1.26.3
pandas          2.2.0
martindurant commented 3 weeks ago

In the version of this library on main, we have changed this library quite substantially, to make it simpler yet support more dataframe libraries. Therefore, the pandas "awkward" dtype will disappear, and only the .ak accessor (on series and dataframes) as the way to get awkward's vectorised nested/ragged operations. The data columns themselves will tend to be stored in arrow layout, which is becoming the pandas standard.

That's a rather long way of saying, that iterrows() will "just work" as it does for any other data type that pandas already knows about.

Exactly how to get your data to be stored as arrow is another matter and one that pandas seems a bit confused about (see here). With https://github.com/intake/awkward-pandas/pull/56 , which I just posted, you could do

df["numbers"] = df.numbers.ak.to_output()

(note that you don't need your data to be in arrow storage before using .ak)