Open iguinn opened 2 weeks ago
Personally, I would use it purely to write and read the bins and data/weights from and to boost-hist, whatever is needed to recreate the boost-hist object without having to re-fill it with data. i.e. if you use LGDO to read a histogram from an .lh5 file, it should return a boost-hist, and if you use LGDO to write a boost-hist, it writes it to file in some predetermined way. That way, the histogram can be saved and communicated easily between different people and scripts and does not require adding any supporting methods to LGDO. It is a small price to convenience if the user cannot directly apply some methods to the LGDO object, but I would think it makes implementation much easier and cleaner. So this falls to your option 1.
As Louis said, we should try to keep LGDOs as pure I/O classes as possible. Third-party formats are much better than us at data manipulation and we want to keep our code as lightweight as possible. This is the overall philosophy behind this package.
As a side note: I personally think that the approach followed by the LH5IO.jl package (i.e. directly read data into Julia native data structure) is better than ours, but it would be harder to implement in Python, unfortunately.
I see one problem with option 1, especially with large histograms. The conversion into a hist.Hist
or boost-histogram
is always copying the whole data buffer. So if you want to update a histogram in an LGDO file, this way would always use twice the actually required memory.
For small histograms this would be negligible and not worth the code overhead here.
For optical maps this could easily be some GBs of extra memory allocation.
(I am not saying that I am at the moment thinking about using this fill function for reboost(-optical)
, but it might come in handy for similar cases).
I think part of the problem with trying to make them pure I/O, is that sometimes I/O is best done with somewhat specialized containers (see https://github.com/legend-exp/legend-pydataobj/pull/109). When dealing with large amounts of data, iterating becomes important, and so having a way to fill in steps becomes useful (or in the case of other objects append). In this case, the hist
library is quite good at filling in this way, so in this case we can probably get away more with pure I/O. However, in other cases, filling directly into the LGDO object is quite normal; also, the numpy arrays required a lot of re-allocating/copying to do any change in size, so I'm not sure we want to make pure I/O our goal?
The documentation makes it sound like hist.Hist.view()
is no-copy (although I haven't tested it or anything...).
I'll also add that if the best way to fill histograms is through boost histogram, we should include filling in the tutorial.
The documentation makes it sound like hist.Hist.view() is no-copy (although I haven't tested it or anything...).
That's right. But there is no way to create a hist.Hist
from a numpy buffer, without copying data (the data=
constructor argument also copies in all cases...).
This is a different situation than for - just an example - pandas tables that can be initialized and manipulated from a loaded LGDO without copying all data again.
I recently added a
.fill
function to the lgdo histogram. This turned out to be controversial, under the argument that any non-trivial histogram operations should be done by thehist
library. So we should figure out the proper scope of this class. @lvarriano @ManuelHu @gipert @jasondetOption 1: It's purely an I/O class. If we want to do any manipulations at all on it we should look towards other implementations Option 2: It's a data container class. This means it should have both I/O functions and functions for setting the contents (this was the logic I was following when adding fill). Under this logic the only other functions from boost-hist (https://boost-histogram.readthedocs.io/en/latest/user-guide/histogram.html) that I could imagine wanting, are
reset
, and maybeadd
. A counter argument to this is the risk of scope creep. Option 3 (which nobody wants but I include for completeness): A full fledged hist class that does all the things in https://boost-histogram.readthedocs.io/en/latest/user-guide/histogram.html. This seems clearly beyond the scope of what we want, and usingview_as
to get these kinds of manipulations seems clearly preferable.