scikit-hep / histbook

Versatile, high-performance histogram toolkit for Numpy.
BSD 3-Clause "New" or "Revised" License
109 stars 9 forks source link

Consider minimizing Hist pickle size #11

Closed jpivarski closed 6 years ago

jpivarski commented 6 years ago

@imandr: inspired by our conversation about remotely filling Hist objects. A fairly minimal histogram of 10 bins

h = Hist(bin("x", 10, -5, 5), fill=numpy.random.normal(0, 1, 1000000))

has 85% of its size in its metadata.

>>> len(pickle.dumps(h._content, protocol=2))
255
>>> len(pickle.dumps(h, protocol=2))
1494

(Pickle protocol 2 is binary. Approximately the same ratios for the default protocol 0 and the Python 3 protocols.)

In this PR comment stream, I'll post updates.

jpivarski commented 6 years ago
>>> len(pickle.dumps(h._content, protocol=2))
255
>>> len(pickle.dumps(h, protocol=2))
527
jpivarski commented 6 years ago
>>> len(pickle.dumps(h._content, protocol=2))
255
>>> len(pickle.dumps(h, protocol=2))
326
jpivarski commented 6 years ago

@imandr: Is that acceptable? A 10-bin histogram is 20% metadata, 80% bin contents.

Most histograms have more than 10 bins: a 100-bin histogram is 5% metadata, 95% bin contents.

jpivarski commented 6 years ago

After reverting the one dangerous optimization but keeping the rest, the 10-bin histogram is 25% metadata, 75% bin contents.

>>> len(pickle.dumps(h._content, protocol=2))
255
>>> len(pickle.dumps(h, protocol=2))
345

A 100-bin histogram is 8% metadata, 92% bin contents. I'm going to claim this is okay.

The "dangerous" optimization replaced axis class names with integers, but if someone uses pickle to save a histogram and reconstitute it years later when there are new axis classes, the integer might not line up (like C++ enums). After reverting it, it's no more dangerous than normal pickling (which is not perfectly archival, either).

jpivarski commented 6 years ago

Actually, what I've been calling "bin contents" is the pickled Numpy array, which has a little metadata of its own.

After all the optimizations, the total pickled size of the object is consistently 245 bytes larger than sizeof(float) * numbins, what you'd get if you only serialized the numbers with no context whatsoever. That's 90 bytes from the Hist metadata and 155 bytes from Numpy's metadata, independent of the number of bytes (in a 1-d histogram).

By sending and receiving only the h._content, you're sending the bin contents + 155 bytes per histogram; sending the whole histogram h instead would now only add another 90 bytes per histogram. That ought to be small enough. Before this optimization effort, the Hist overhead was 1235 bytes, rather than 90 bytes (factor of 13 smaller).