scikit-hep / histbook

Versatile, high-performance histogram toolkit for Numpy.
BSD 3-Clause "New" or "Revised" License
109 stars 9 forks source link

Implementation of export to TH2 #47

Closed clelange closed 5 years ago

clelange commented 6 years ago

Hi, I've been playing around with uproot and histbook, and I find it nice for data exploration. However, I then would like to profit from RooFit in my analysis workflow, and I usually export the histogram to ROOT for this purpose (and then import it as RooDataHist) When trying to export a 2-dimensional Hist, defined e.g. as

hp_dwcReferenceType_ntracks = Hist(bin("dwcReferenceType", 16, -.5, 15.5), profile("ntracks"))

or

h2_dwcReferenceType_ntracks = Hist(bin("dwcReferenceType", 16, -.5, 15.5), bin("ntracks", 2, -.5, 1.5))

I get a NotImplementedError: TH2 (from https://github.com/scikit-hep/histbook/blob/master/histbook/export.py#L290-L292).

Will exporting to TH2 be available soon and/or do you have a timescale for that?

jpivarski commented 6 years ago

Without this request, the timescale would be a few months because of backlog.

Would you be interested in implementing it yourself and submitting a pull request? It may be easier than you think. Do you see the NotImplementedError in export.py, as well as the code for handling one-dimensional histograms and profiles above it? The only tricky thing is underflow/overflow handling: ROOT treats the 0th index as underflow, and histbook allows the underflow/overflow/nanflow to not exist, which means that the meaning of the bins shifts by one depending on whether there's an underflow or not. (ROOT has no concept of nanflow— ignore it.) However, the one-dimensional implementation illustrates this pattern.

The one thing that the one-dimensional version doesn't handle is multiple dimensions. The content that all dimensionalities use as input is from Hist.table, a high-level, user-facing function that makes Numpy record arrays with a shape that matches the dimensionality of the histogram. Just as with the one-dimensional case, you'd be able to pick out "count()" and "err(count())" (as a record array) but also use nested indexes to pick out bins in a simple double-for loop (accounting for the possible missing underflow at each level). The hard part might be the ROOT multidimensional bin indexing. (I don't know if ROOT sets bin contents with a serialized bin index, a multidimensional one, or both.)

clelange commented 6 years ago

Hi Jim,

OK, that sounds doable, but I won't be able to look into this before September and then the timescale would be weeks. I'll write again once I've started looking into it. With TH2 implemented, TH3 should be straight-forward.

jpivarski commented 6 years ago

Thanks, and I understand about the timescale!

I'll leave this issue open, which you can use to ask me any questions about it and other users can follow the development (or take it over if they need 2-d histograms on a shorter timescale).

jpivarski commented 5 years ago

Thanks!