scikit-hep / aghast

Aghast: aggregated, histogram-like statistics, sharable as Flatbuffers.
BSD 3-Clause "New" or "Revised" License
17 stars 9 forks source link

aghast in C++? #44

Open HDembinski opened 2 years ago

HDembinski commented 2 years ago

A commonly requested feature for Boost.Histogram in C++ is to convert from and to ROOT histograms. In Python, we can do that already now with aghast, but not from C++. Calling into a Python library from C++ is both ghastly and awkward, so I wonder whether we could port Aghast to C++ to support this.

I also like the flatbuffer serialization very much, this could be the "native" format for Boost.Histogram (C++) and boost-histogram (Python) to talk to each other. Right now, the C++ code has only a generic serialization protocol in place that works with Boost.Serialization and compatible libraries like Cereal. The Python library uses this infrastructure to break a Boost.Histogram object into Python primitives, which are then serialized in the usual way. This produces a Python-specific format that is clunky to read from another language. Another commonly requested feature is to be able to dump a histogram in C++ to disk and read it again from Python. This is currently not supported, since the two libraries do not have a common serialization format. The flatbuffers could fill that role.

jpivarski commented 2 years ago

I learned from this project that while Flatbuffers may be a very fast protocol in C++, the Python implementation is purely for functionality. All the indirection that's supposed to be in-cache in C++ is very slow unpacking for Python.

Fundamentally, what should the serialization format for histograms be? This Flatbuffers one was an attempt at a time when I thought that a lot of histogram libraries would proliferate. Now it's settling down to just ROOT and Boost.Histogram. The Flatbuffers schema here is trying to be ultra-general, like allowing the underflow and overflow bins to be anywhere, but as you pointed out, there's only one convention now: ROOT and Boost.Histogram both put the underflow before and the overflow after the bins. Flatbuffers has three strong selling points:

  1. the C++ serialization/deserialization is very fast (I'm trusting the Flatbuffers documentation on this; haven't tried it myself)
  2. it's the only rowwise/record-oriented format that I know of that can selectively decompress parts of the record, for any schema
  3. this indirection also makes it very capable of schema evolution: old code doesn't have to know it's skipping over new fields, as it's just following pointers.

So the question is, how important are these selling points to the specific problem of serializing and deserializing Boost.Histogram? That's a different question than we were asking before—the old question was "What's important for serializing/deserializing any histogram in any histogram library?" Surely, (1) is still important. Is (2) important? That was to accommodate a union of features from all histogram libraries, with the idea that any particular histogram library won't be using all of them. Now, if you're deserializing every field every time, there might not be as much value in that. Similarly, the flexibility in point (3) mattered when we were taking on the problem of all current and future histogramming libraries, but focusing on Boost.Histogram, the data structures may be stable enough that you don't need such an extreme of flexibility. This can be adjusted for in the Flatbuffers schema itself by making never-changing substructures into struct, rather than record, which removes a level of indirection. (In the Flatbuffer examples, struct is used for things like 3D points, which will always have exactly three fields, and record is used for the video game Monster type, which may get new attributes in the future.)

Flatbuffers's three selling points don't address some points that are important to physics users:

  1. being able to store histograms and event data in the same file
  2. interoperability with ROOT
  3. efficiently storing thousands of histograms with mostly the same metadata—ideally by writing one copy of the metadata.

In principle, (4) could be addressed by implementing the histogram serialization in Parquet. The Flatbuffers schema in Aghast isolates the bin contents, to keep them columnar—in Parquet, they'd automatically be columnar. Everything else (the bulk of the Aghast Flatbuffers schema) can be Parquet metadata, which is any bytestring associated with a node of an array.

Awkward 2.0 will be using this metadata, so here's a demonstration:

>>> import awkward as ak
>>> import numpy as np
>>> import pyarrow.parquet as pq
>>> contents, edges = np.histogram(np.random.normal(0, 1, 100000))
>>> histogram_array = ak._v2.highlevel.Array(
...     ak._v2.contents.NumpyArray(
...         contents,
...         parameters={"edges": edges.tolist()},
...     )
... )
>>> histogram_array
<Array [83, 846, 5408, 18452, ..., 12456, 2860, 349, 18] type='10 * int64[pa...'>
>>> histogram_array.type
ArrayType(NumpyType('int64', parameters={'edges': [-4.0401437884795435, -3.1990114867101216, -2.3578791849407, -1.5167468831712787, -0.6756145814018568, 0.16551772036756507, 1.006650022136986, 1.847782323906408, 2.68891462567583, 3.5300469274452517, 4.371179229214674]}), 10)
>>> pq.write_table(
...     ak._v2.operations.convert.to_arrow_table(histogram_array),
...     "tmp.parquet",
... )
>>> from_disk = ak._v2.operations.convert.from_arrow(pq.read_table("tmp.parquet"))
>>> from_disk
<Array [83, 846, 5408, 18452, ..., 12456, 2860, 349, 18] type='10 * int64[pa...'>
>>> from_disk.type
ArrayType(NumpyType('int64', parameters={'edges': [-4.0401437884795435, -3.1990114867101216, -2.3578791849407, -1.5167468831712787, -0.6756145814018568, 0.16551772036756507, 1.006650022136986, 1.847782323906408, 2.68891462567583, 3.5300469274452517, 4.371179229214674]}), 10)

Awkward Array sends its parameters through JSON (assuming that they're small), but a path that goes directly from Boost.Histogram to Arrow/Parquet could encode them as a binary array of doubles. Or Awkward can be modified to store some parameters in a more efficient format if this becomes a use-case.

It's also possible to write metadata to HDF5, and HDF5 would bring a lot of options for tuning sparse data. However, jagged event data can't be stored in HDF5 efficiently, so if this is for motivation (4), putting event data and histograms in the same file, it wouldn't be satisfied by HDF5.

On the other hand, maybe the edges could be part of the data, rather than metadata:

>>> histogram_array2 = ak._v2.highlevel.Array(
...     ak._v2.contents.RecordArray(
...         [
...             ak._v2.contents.NumpyArray(edges[:-1]),
...             ak._v2.contents.NumpyArray(edges[1:]),
...             ak._v2.contents.NumpyArray(contents),
...         ],
...         ["low", "high", "value"],
...         parameters={"__array__": "histogram"},
...     )
... )
>>> histogram_array2
<Array [{low: -4.34, high: -3.5, value: 23}, ..., {...}] type='10 * struct[[...'>
>>> histogram_array2.show()
[{low: -4.34, high: -3.5, value: 23},
 {low: -3.5, high: -2.66, value: 396},
 {low: -2.66, high: -1.82, value: 3160},
 {low: -1.82, high: -0.98, value: 13006},
 {low: -0.98, high: -0.141, value: 28114},
 {low: -0.141, high: 0.698, value: 31102},
 {low: 0.698, high: 1.54, value: 17920},
 {low: 1.54, high: 2.38, value: 5436},
 {low: 2.38, high: 3.21, value: 769},
 {low: 3.21, high: 4.05, value: 74}]

With the above, a large number of histograms could be stored in one array, satisfying (6), and it's clear to see how it could be generalized to Boost.Histogram's many storage types. If the edges are repeated frequently, they could be dictionary-encoded in Parquet.

Furthermore, maybe an Arrow/Parquet format that has a natural and meaningful deserialization into Pandas would also be a good idea. Arrow is supposed to be a fast backend for Pandas, and "interoperability with Pandas" may be an important criterion (7) in the future.

Then there's criterion (5), interoperability with ROOT. As we've seen, some users are already serializing Boost.Histogram with ROOT streamers—somehow, in a non-standardized way. Standardizing that by adding the ROOT/Cling-dictionary generating macros (dummy macros for when Boost.Histogram is not used in ROOT (i.e. #ifndef...), would at least ensure that a million formats don't proliferate.

But what about interoperability with ROOT's v7 histograms? Do v7 histograms have all the features of Boost.Histogram, in the sense that Boost.Histograms can be losslessly converted to and from ROOT v7 histograms? If so, then maybe that could be the serialization format. It would satisfy criteria (1), (3), (4), (5), and maybe (6), by joining histograms with like-metadata into a single histogram with an extra axis.

I'm looking forward to the prospect of there being a standardized serialization format for Boost.Histogram. Flatbuffers was chosen for a different reason, interoperability with all histogramming libraries, which is less important now that Boost.Histogram is dominant in for non-ROOT users. Flatbuffers is still an okay choice, but there are some things it doesn't do. Alternatives include Arrow/Parquet, Awkward/Arrow/Parquet, Pandas/Arrow/Parquet, HDF5, ROOT as a new class, and ROOT as v7 histograms.

What do you think?

bendavid commented 2 years ago

I looked into the case of ROOT persistence of boost histograms a bit.

This doesn't work with the default storage type (I think because of reasons which @HDembinski outlined in https://root-forum.cern.ch/t/boost-1-70-released-new-library-boost-histogram/33595/12 )

On the other hand with histograms from make_weighted_histogram this seems to work just fine, albeit with some (maybe spurious) warnings.

I've opened an issue for ROOT about this (https://github.com/root-project/root/issues/9371)

I don't think that pre-generating dictionaries is necessarily a complete solution, since one may want to serialize a template instance which is only instantiated on the fly with PyROOT/Cling (but maybe for a subset of commonly used template instances this could make sense)