Open hombit opened 1 week ago
Some motivating benchmarks
from nested_pandas.datasets import generate_data
nf = generate_data(10_000, 1000)
%timeit nf.reduce(np.mean, 'nested.flux')
# 43.3 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
flux = pa.array(nf['nested']).field('flux') # this is fast, ~ 5μs
%timeit np.add.reduceat(flux.values, flux.offsets[:-1]) / np.diff(flux.offsets)
1.92 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Feature request
Today, we have these ways to aggregate a single nested column values:
nf.reduce(np.mean, "lc.mag")
- good, but not cheap and requires to join the output back to the framenf.eval("lc.mag.groupby(by=lc.mag.index).mean()")
- expansive and not intuitiveIt would be nice if we can develop an easier way of doing such aggregations. Options I see:
nf.eval("lc.mag.mean()")
/nf["lc.mag"].mean()
, but it would output the aggregation over all the flat values, which is, especially in the first case, not intuitive. We can redefine it..nest
accessor, e.g.nf.lc.nest.mean()
would returnnf.shape[0]
mean values.eval/query
environment only, e.g.nf.eval("lc.mag.nest_mean()")
However I'm not sure how we'd make all these performant, it looks like
pyarrow
provides almost zero tooling for that. Maybe we can use things likenumpy.ufunc.reduceat
andscipy.ndimage.mean
.Before submitting Please check the following: