legend-exp / legend-pydataobj

LEGEND Python Data Objects
https://legend-pydataobj.readthedocs.io
GNU General Public License v3.0
1 stars 9 forks source link

Histogram Scope #114

Open iguinn opened 2 weeks ago

iguinn commented 2 weeks ago

I recently added a .fill function to the lgdo histogram. This turned out to be controversial, under the argument that any non-trivial histogram operations should be done by the hist library. So we should figure out the proper scope of this class. @lvarriano @ManuelHu @gipert @jasondet

Option 1: It's purely an I/O class. If we want to do any manipulations at all on it we should look towards other implementations Option 2: It's a data container class. This means it should have both I/O functions and functions for setting the contents (this was the logic I was following when adding fill). Under this logic the only other functions from boost-hist (https://boost-histogram.readthedocs.io/en/latest/user-guide/histogram.html) that I could imagine wanting, are reset, and maybe add. A counter argument to this is the risk of scope creep. Option 3 (which nobody wants but I include for completeness): A full fledged hist class that does all the things in https://boost-histogram.readthedocs.io/en/latest/user-guide/histogram.html. This seems clearly beyond the scope of what we want, and using view_as to get these kinds of manipulations seems clearly preferable.

lvarriano commented 2 weeks ago

Personally, I would use it purely to write and read the bins and data/weights from and to boost-hist, whatever is needed to recreate the boost-hist object without having to re-fill it with data. i.e. if you use LGDO to read a histogram from an .lh5 file, it should return a boost-hist, and if you use LGDO to write a boost-hist, it writes it to file in some predetermined way. That way, the histogram can be saved and communicated easily between different people and scripts and does not require adding any supporting methods to LGDO. It is a small price to convenience if the user cannot directly apply some methods to the LGDO object, but I would think it makes implementation much easier and cleaner. So this falls to your option 1.

gipert commented 2 weeks ago

As Louis said, we should try to keep LGDOs as pure I/O classes as possible. Third-party formats are much better than us at data manipulation and we want to keep our code as lightweight as possible. This is the overall philosophy behind this package.

As a side note: I personally think that the approach followed by the LH5IO.jl package (i.e. directly read data into Julia native data structure) is better than ours, but it would be harder to implement in Python, unfortunately.

ManuelHu commented 2 weeks ago

I see one problem with option 1, especially with large histograms. The conversion into a hist.Hist or boost-histogram is always copying the whole data buffer. So if you want to update a histogram in an LGDO file, this way would always use twice the actually required memory.

For small histograms this would be negligible and not worth the code overhead here. For optical maps this could easily be some GBs of extra memory allocation. (I am not saying that I am at the moment thinking about using this fill function for reboost(-optical), but it might come in handy for similar cases).

iguinn commented 1 week ago

I think part of the problem with trying to make them pure I/O, is that sometimes I/O is best done with somewhat specialized containers (see https://github.com/legend-exp/legend-pydataobj/pull/109). When dealing with large amounts of data, iterating becomes important, and so having a way to fill in steps becomes useful (or in the case of other objects append). In this case, the hist library is quite good at filling in this way, so in this case we can probably get away more with pure I/O. However, in other cases, filling directly into the LGDO object is quite normal; also, the numpy arrays required a lot of re-allocating/copying to do any change in size, so I'm not sure we want to make pure I/O our goal?

The documentation makes it sound like hist.Hist.view() is no-copy (although I haven't tested it or anything...).

I'll also add that if the best way to fill histograms is through boost histogram, we should include filling in the tutorial.

ManuelHu commented 1 week ago

The documentation makes it sound like hist.Hist.view() is no-copy (although I haven't tested it or anything...).

That's right. But there is no way to create a hist.Hist from a numpy buffer, without copying data (the data= constructor argument also copies in all cases...). This is a different situation than for - just an example - pandas tables that can be initialized and manipulated from a loaded LGDO without copying all data again.