lincc-frameworks / nested-pandas

Efficient Pandas representation for nested associated datasets.
https://nested-pandas.readthedocs.io
MIT License
9 stars 1 forks source link

Make it easier to run aggregations over nested elements in nf.eval, nf.query and nf.nested.nest #155

Open hombit opened 1 week ago

hombit commented 1 week ago

Feature request

Today, we have these ways to aggregate a single nested column values:

It would be nice if we can develop an easier way of doing such aggregations. Options I see:

  1. Currently, we can do nf.eval("lc.mag.mean()") / nf["lc.mag"].mean(), but it would output the aggregation over all the flat values, which is, especially in the first case, not intuitive. We can redefine it.
  2. Add special interface for nested aggregations with .nest accessor, e.g. nf.lc.nest.mean() would return nf.shape[0] mean values.
  3. Add special methods which would work in eval/query environment only, e.g. nf.eval("lc.mag.nest_mean()")

However I'm not sure how we'd make all these performant, it looks like pyarrow provides almost zero tooling for that. Maybe we can use things like numpy.ufunc.reduceat and scipy.ndimage.mean.

Before submitting Please check the following:

hombit commented 6 days ago

Some motivating benchmarks

from nested_pandas.datasets import generate_data

nf = generate_data(10_000, 1000)

%timeit nf.reduce(np.mean, 'nested.flux')
# 43.3 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

flux = pa.array(nf['nested']).field('flux')  # this is fast, ~ 5μs
%timeit np.add.reduceat(flux.values, flux.offsets[:-1]) / np.diff(flux.offsets)
1.92 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)