scikit-hep / uproot3-methods

Pythonic behaviors for non-I/O related ROOT classes.
BSD 3-Clause "New" or "Revised" License
21 stars 28 forks source link

Generalize "pandas" method to support TH2 #59

Closed guitargeek closed 5 years ago

guitargeek commented 5 years ago

Hello!

I would really enjoy a method for the TH2 which converts it into a pandas data frame that is easily human-readable. This way we could compare the values in different TH2s with the same binning very quickly, which I would need to compare two dimensional scale factors for example.

Since a similar pandas function already exists for the TH1 in uproot-methods (which the TH2 inherits but that does not work there), I thought it might be good to push also my implementation for the TH2?

The pandas function for the TH1 gives you DataFrames like these:

TH1_pandas

I did not really like this index with the bin edges, as it looks dangerous to use with floating point bin edges. I propose instead to have some MultiIndex with the bin indices for the TH2 case:

TH2_pandas

And concerning the unit tests: can I easily create a TH2 without reading it from some ROOT file to implement a test?

What do you think?

Cheers, Jonas

jpivarski commented 5 years ago

First of all, thanks!

The MultiIndex is the most important thing. In principle, I think it should be a MultiIndex of IntervalIndexes because that conveys the most information. If you have a DataFrame with integer indexes for bins, you don't know whether it means "0 to 10," "-5 to 5," "1000 to 2000," etc. Manual bookkeeping goes against the point of DataFrames.

You don't have to select rows in a DataFrame using floating point numbers, which would be dangerous (floating point equality can be imprecise). A DataFrame's loc is in data coordinates, like these intevals, but iloc is for selecting ranges of integer bins. It's not necessary to make the index integers to select them as integers.

Thoughts?

guitargeek commented 5 years ago

Hi Jim, thanks for your answer! No thoughts yet, I first have to understand what IntervalIndices are :D But I'll come back to that.

jpivarski commented 5 years ago

IntervalIndex is in the first screenshot: it can be composed with MultiIndex to get intervals at two levels.

I wish the graphical representation didn't show as many digits—it used to be rounded, which is good for significands that aren't exactly a multiple of 2.

guitargeek commented 5 years ago

Oh okay that's very cool! Yes I was a bit confused by the representation, which didn't look very nice. But I see that this should not keep us from using the IntervalIndex! I'll update the PR shortly.

guitargeek commented 5 years ago

After playing around with the IntervalIndex, I realized we could actually implement the function in a general way for n dimensions.

Here how it looks for 1D:

TH1_pandas

And here for 2D:

TH2_pandas'

The only thing that does not generalize is how the names for the indices are extracted.

For 1D it changed a bit, because I did not re-implement the dropping of empty bins. You think this should really be done? Zeros counts is also some information after all, isn't it?

jpivarski commented 5 years ago

There's no need to drop empty bins, and I don't know why I might have done that. (Pandas does like sparse representations, but I'm sure it would confuse a physicist )

This looks good—once the tests pass, I'll accept it.

guitargeek commented 5 years ago

Nice! Sorry for interfering and force-pushing in the meantime, I thought people in the US still sleep and I didn't expect your response. I just changed a little thing in my last commit and added:

        if dim == 3:
            index_name_objects = (self._fXaxis, self._fYaxis, self._fZaxis)

...which might generalize this also for TH3.

guitargeek commented 5 years ago

Sorry, another little change. I should use elif for the dimension checks instead of one if for each dimension...

jpivarski commented 5 years ago

We can leave that in there as an expression of intent—when it first gets tested with a TH3, it might fall, but we can look at the code and see that it was supposed to work (and probably will with a minor tweak, but we just don't know until there's a test).

If you have other ideas, let me know and I'll hold off on merging, otherwise, give me the go-ahead and I'll do it. (We normally express that with a [WIP] in the PR name.)

guitargeek commented 5 years ago

Yes, that was my thought there. But now I'm out of ideas, so you can merge when you are ready! I double checked that my use-case is fulfilled. Thank you!