scikit-hep / aghast

Aghast: aggregated, histogram-like statistics, sharable as Flatbuffers.
BSD 3-Clause "New" or "Revised" License
17 stars 8 forks source link

Connector to fnal_column_analysis_tools histograms #11

Closed nsmith- closed 5 years ago

nsmith- commented 5 years ago

Things seem to match up well, the only missing functionality on the aghast side is the sparse category binning, the subject of #10

nsmith- commented 5 years ago

also, first!

nsmith- commented 5 years ago

I'd rather not have warnings anywhere in this code. It's likely to be buried, so a user-readable message wouldn't be useful—if somebody did actually see it, it would probably just be confusing.

Either quietly drop parts of the ghast because they're not essential—ghast to specific library conversions will always be conversions, anyway—or raise ValueError exceptions if they're essential.

Some conventions are result-changing, would think better to warn about them. I don't think silent is the right answer but on the other hand they are sort of up to user to decide if important. Maybe with a flag in the calling function to ignore certain 'somewhat-compatible' differences, otherwise ValueError? Could you hide the ovf_convention function by defining it inside the function where it's used? That way, the only functions this module exports are ones that the user would conceivably want access to.

Used in both functions currently, I'll add _ to mark 'private' No need for the try-import error here. The submodule is not loaded by any other modules—calling code must explicitly ask for it—so that code will be the gatekeeper for whether the dependencies are actually installed. See the other submodules in this directory.

I followed the convention used in pandas.py..

jpivarski commented 5 years ago

Geese trounce all!

jpivarski commented 5 years ago

Oh, I cancelled that review too early: it still has warnings and we need to talk about that. I'll do a case-by case.

(The import error thing is fine; I'm just going for consistency and you pointed out that I'm inconsistent. I'll address it globally someday, either making them all try-catch or none.)

nsmith- commented 5 years ago

Ok last thing: is there a function to return a numpy array with alternative overflow conventions? I saw the h.counts[-np.inf, np.inf], but h.counts[-np.inf, np.nan] did not seem to do what was expected.

jpivarski commented 5 years ago

No, I just couldn't find a way to express it in a slice. If you can think of a preferred way to ask for an arbitrary arrangement of bins, underflow, overflow, and nanflow, I can add it.

Perhaps something like a fancy index? But we wouldn't want to require the user to specify all the regular bins in their normal order, since this would be a common case.

Oh! Maybe [-inf, ..., inf, nan], where we use the ellipsis symbol for regular bins?

jpivarski commented 5 years ago

Do you need that feature before this PR can be finished?

jpivarski commented 5 years ago

I just added it (6fbac01f0ab11d66dc6744d0c1928a1a14903d26). You should be able to say

ghastly_hist.counts[[-numpy.inf, ..., numpy.inf, numpy.nan]]

to get the underflow bin, then all the normal bins, then the overflow bin, then the nanflow bin.

You can ask for a different order if you need it—Aghast has most of the fancy indexing tricks that Numpy does, with special treatment of overflows, since 0 always means the first non-overflow bin, no matter how it's laid out in memory.

nsmith- commented 5 years ago

Ok that seems to work, although is there an option to do some sort of fillna-like thing, e.g.

h = Histogram(axis=[Axis(EdgesBinning(edges=np.arange(6)))], counts=UnweightedCounts(InterpretedInlineBuffer.fromarray(np.arange(5))))
h.counts[[-np.inf, ..., np.inf, np.nan]]

throws an error because there is no overflow. Should I handle that myself?

jpivarski commented 5 years ago

That was a conscious choice, though I'm open to debating the wisdom of it. It seems to me that if you explicitly ask for an overflow bin and there isn't an overflow bin, you should get an error to give you a chance to think about what you're doing. Simply ignoring it would give you a Numpy array of a different shape than you'd expect, and you might end up confused about an error downstream.

Python and Numpy slices ignore missing endpoints. If you ask for [5:15] in an array that only goes up to 10, you'll get the same thing as [5:10]. So raising an error here is inconsistent with that.

Numpy fancy indexes will raise an error if you ask for indexes that don't exist, so this is consistent with that. I'm considering this an extension of the latter convention.

Still: up for debate.

nsmith- commented 5 years ago

I suppose this is a similar situation to the slightly-incompatible axis thing earlier: do we want to error rather than filling zeros for unknown bins? I'm starting to prefer the idea of adding strict=False option to toX() and fromX()

jpivarski commented 5 years ago

Strictness as an option is not a bad idea, though it's a rabbit hole as to how many options we provide. We need breadth (now formats) before this kind of depth.

As for zero bins, that's our model for combining histograms with different binnings: the idea is that a histogram that doesn't specify certain bins is representing a distribution in which those are zero. The underflow/overflow is somehow representing the range all the way out to infinity...

This behavior is similar to Pandas's merge, but instead of imputing nan for missing rows, we impute zero, and this makes more sense for additive things like histograms.

The strictness you're looking for might be #2, where there will be an option to disallow adding histograms if their bins don't exactly match. (And "fast," where you just assume everything's fine and go!)

nsmith- commented 5 years ago

As for zero bins, that's our model for combining histograms with different binnings: the idea is that a histogram that doesn't specify certain bins is representing a distribution in which those are zero. The underflow/overflow is somehow representing the range all the way out to infinity...

In that case, I think the slicing should also produce zero bins when a user asks for overflow, at least by default. Maybe a strict option would be good here as well.

jpivarski commented 5 years ago

Oh, I see. Yes, I understand the logic there, but that would mean it would have to create bins, not just select from them and shuffle them around. That wouldn't strictly be a Numpy fancy index.

I had to go through that pain for the loc and iloc, but it greatly increased the complexity. By the time I got to the counts indexes, I was exhausted and did the simple thing.

At least having an error now means that it can be expanded to fill zeros in the future. You can extend features into pervious error states, but not vice-versa.

nsmith- commented 5 years ago

In that case, then I can leave this as-is. Merge if you're happy