intake / akimbo

For when your data won't fit in your dataframe
https://akimbo.readthedocs.io
BSD 3-Clause "New" or "Revised" License
21 stars 6 forks source link

Unusually slow `pandas.concat()` on awkward series #48

Closed chuyuanliu closed 4 months ago

chuyuanliu commented 4 months ago

Hi,

I am trying to concatenate multiple dataframes using pandas.concat(). When there are columns of awkward series, this process seems to be extremely slow.

The packages I am using are

pandas 2.2.0
awkward 2.6.1
awkward_pandas 2023.8.0

I dig a little bit into the source code and find these lines https://github.com/intake/awkward-pandas/blob/d0f789388a9a0517c4c7c722bd7f3656910b5260/src/awkward_pandas/array.py#L130-L132 Looks like this code is actually call ak.concatenate on pandas.Series instead of the raw array. A fix works for me is to do something like:

@classmethod
def _concat_same_type(cls, to_concat):
    return cls(ak.concatenate([a._data for a in to_concat]))

Tested on the following sample

np.random.seed(0)
size = 1_000_000
shape = np.random.choice(range(1, 5), size)
data = np.ones(np.sum(shape), dtype="float64")
array = ak.unflatten(data, shape)

df = pd.DataFrame({"test": akpd.from_awkward(array)})
concated = pd.concat([df] * 10, ignore_index=True, copy=False)

the last line takes about ~500s without the fix and ~0.2s with the fix.

jpivarski commented 4 months ago

Thanks! With a change in speed like that, you've probably found a case in which the Awkward Array was converted into Python objects with to_list and then converted back with from_iter. That's not supposed to ever happen, but a 2500× speedup is suggestive that it did happen here.

Since you've also solved the issue, this should be a PR, and I got one started for you in #49.