Open jpivarski opened 2 years ago
Hi Jim, thank you for this note, but I think that adding support for awkward is adding complexity for us without a real performance benefit.
If the computational cost of generating the event and doing computation with the event is large compared to the cost of the Python loop over events, then you do not gain from using awkward. This is most certainly the case here. You can just use ordinary numpy arrays, which is simpler than working with awkward arrays.
As far as I understand, impy + pyhepmc is already faster in generating a HepMC3 output file than the pure C++ code CRMC, although the impy loops over events in Python and the generator output is converted into numpy arrays which are converted back into HepMC3 data structures before they are written to the disk. This seems to indicate that the current approach is already very performant.
If the particle multiplicity per event was small, then awkward should accelerate things, but the multiplicity at the LHC is fairly large.
I learned about this project from @HDembinski in PyHEP 2022:
I agree that providing a truly Pythonic interface to not just one but a lot of generators is a great idea!
Providing an iterator that yields NumPy arrays, like numpythia and pyjet, partially accelerates analysis by allowing vectorization over particles, but not over events. Each event has a different number of particles, so such an array can't be a NumPy array, but it can be an Awkward Array.
@aryan26roy and I implemented such an interface in https://github.com/scikit-hep/fastjet and wrote about it in arXiv:2202.03911. The key idea is to minimize the number of Python objects and steps through the Python interpreter:
The fastjet library was written with a minimal coupling between Python and C++ (having learned how not to do Python-C++ couplings), passing only simple data types—numbers and borrowed arrays—between the languages, and it lets Python/NumPy own the array buffers, so that they're scoped as Python users expect. However, the fastjet interface also builds the Awkward types manually, and there is now a better way to do it, which requires less maintenance.
This summer, @ManasviGoyal implemented an Awkward Array-builder in header-only C++ that automates this process without a loss of performance (which the ak.ArrayBuilder interface has, if you're familiar with it). Her new awkward::LayoutBuilder specializes an Awkward data structure using C++ templates, which can then be filled and converted to a Python Awkward Array through ak.from_buffers.
The LayoutBuilder documentation is still being integrated into the docs:
https://github.com/scikit-hep/awkward/blob/manasvi/layout-builder-user-guide/docs-sphinx/user-guide/how-to-use-header-only-layoutbuilder.md
but I can give a short walk-through in this issue.
Getting the code
The header-only LayoutBuilder is in this directory (4 files):
https://github.com/scikit-hep/awkward/tree/main/src/awkward/_v2/cpp-headers/awkward
which could be included in the impy project as a git submodule (how I generally deal with C++ header-only dependencies), and it is also shipped in the
awkward
Python package; you can get-I
compiler flags like this:LayoutBuilder for records and variable-length lists
An array of events with one-value-per-event attributes and different-length lists of particles can be modeled using only records (structs) and variable-length lists. Here's an example of building that with LayoutBuilder:
This
builder
can then be filled withthough you would likely do that in a loop (like ArrayBuilder). The above data are equivalent to:
Getting these data into Python
The LayoutBuilder code doesn't have helper methods to pass the data through pybind11 so that different projects can use different binding generators. (It's currently being used in one Cling project, Awkward ↔ RDataFrame, and one Cython project, ctapipe.) So here's an explanation of how to use it with pybind11.
The data comes out of LayoutBuilder as a set of named buffers and a Form (JSON) that tells Awkward Array how to put it all together. Here's an example of making an Awkward Array like that in Python (not from LayoutBuilder):
So we just need LayoutBuilder to give us these pieces.
Moreover, we want NumPy to own the array buffers, so that they get deleted when the Awkward Array goes out of Python scope, not when the LayoutBuilder goes out of C++ scope. The hand-off therefore needs a few steps.
np.empty(nbytes, dtype=np.uint8)
and getvoid*
pointers to these buffers by casting the output ofnumpy_array.ctypes.data
(pointer as integer).Now you can pass everything over the border from C++ to Python using pybind11's
py::buffer_protocol
for the buffers, as well as an integer for the length and a string for the Form.Unlike the fastjet interface, if you ever need to make a change to the format of the records—add, remove, rename, change the type of a field—you don't need to change anything in the Python-C++ interface. All of that is contained in the specialization of the C++ template and the filling procedure, which are both in your C++ code.
Conclusion
So, I gave you a lot of information without even knowing if you're interested in providing this kind of interface, but it's so that you can assess whether you want to undertake this step, with a realistic sense of what it would involve. Getting batches of events per Python iteration rather than single events can have ergonomic and performance benefits, especially in the limit of large numbers of small events.
Thank you for your consideration!