Vectorize at/__getitem__

henryiii commented 4 years ago

[ ] Vectorize _at
[ ] Vectorize _at_set
[ ] vectorize __getitem__, __setitem__ (uses the above functions internally)

henryiii commented 3 years ago

This is a bit tricky to implement; I've started it, but pybind11 doesn't provide runtime utilities for array access, and I don't want to generate 32 copies of this, so likely will miss the 1.0 target. I think that's fine, as no one has been too worried about missing this so far. The easy buffer access with .view() and such make it a bit less important.

pfackeldey commented 3 years ago

Hi @henryiii @HDembinski ,

I assume the following is related, if not please correct me and I open a fresh new issue... We noticed in the scope of our analysis that __getitem__ is a performance "hurdle" for high dimensional histograms (imagine: dataset axis of O(1000) dataset, category axis of O(100) categories and systematic axis of O(100) shifts).

I will put a snippet here, which makes the performance visible:

import boost_histogram as bh

h = bh.Histogram(
    bh.axis.StrCategory([str(i) for i in range(100)]),  # e.g. datasets
    bh.axis.StrCategory([str(i) for i in range(100)]),  # e.g. categories
    bh.axis.StrCategory([str(i) for i in range(100)]),  # e.g. systematics
    bh.axis.Regular(100, 0, 500),
)

# let's fill a dummy value
h[...] = 1.0

# now the __getitem__ performance:
%timeit h[bh.loc("42"), bh.loc("42"), bh.loc("42"), :].view()
4.08 s ± 61.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit h.view()[h.axes[0].index("42"), h.axes[1].index("42"), h.axes[2].index("42"), :]
20.3 µs ± 669 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Currently we use the second option, since on a larger analysis scale with multiple of these huge histograms this results in a difference of O(hours) and O(seconds) for histogram manipulation, such as grouping datasets to physics processes. However the first one is (obviously) a lot more convenient to use. I think this would be a major improvement, especially for the best usability of hist and boost_histogram for large-scale analysis.

Best, Peter

scikit-hep / boost-histogram

Vectorize at/getitem #149

scikit-hep / boost-histogram

Vectorize at/__getitem__ #149

Vectorize at/getitem #149