.to_numpy() for profiles and other binned statistics

scikit-hep / uproot3

ROOT I/O in pure Python and NumPy.

BSD 3-Clause "New" or "Revised" License

314 stars 67 forks source link

.to_numpy() for profiles and other binned statistics #511

Closed HDembinski closed 4 years ago

HDembinski commented 4 years ago

In https://github.com/scikit-hep/boost-histogram/issues/413, @henryiii and I are currently discussing what the .to_numpy() method should return for histograms that have more than a trivial storage and need to return an estimate of the variance per bin in addition to the value.

We currently return a view into our storage, e.g. for a boost-histogram with Weight() storage, that is a record array with value and variance fields.

Henry suggests that we should simplify this and only return the values, effectively dropping the uncertainty estimates. I am against this, because for any practical use, the variances are as important as the values. I don't mind dropping meta-data in a .to_numpy() conversion, but not actual data, which consists equally of values and uncertainty estimates.

Henry further makes the valid point, that we should be consistent with other libraries and explicitly mentions uproot4 here:

.to_numpy could gain an argument view=True, which would cause it to return the view instead of just the values (the default, to match other libraries like Uproot 4).

I think:

we have some freedom to decide what .to_numpy() conversion means in this case, since we lack a template for this situation from numpy
the conversion should not discard uncertainty estimates, because this would make .to_numpy() unusable in practice
we should try to find a common format what .to_numpy() should return for binned statistics and then consistently use that format in all relevant scikit-hep libraries

If we would normally return (values, edge0, edge1, ...), then we could return ((values, variances), edge0, edge1, ...) or ((values, sigmas), edge0, edge1, ...) for a binned statistic, where values and variances are ordinary numpy arrays.

Returning "sigmas" vs. "variances"

returning sigmas is more intuitive for the casual user
returning variance makes more sense from a statistics and computational point of view
- values and variances are additive, sigmas are not
- uncertainties estimators always compute a variance; returning "sigmas" means one has to do extra work by taking the square root, which a user who actually wants the variances has to undo with yet doing more extra work by squaring the sigmas

Boost.Histogram in C++ therefore always returns variances and not sigmas. In C++, this choice is rather clearly dictated by the strong preference for solutions that do minimal work to achieve some goal ("zero overhead" principle).

jpivarski commented 4 years ago

The extension from (values, edge0, edge1, ...) to ((values, variances), edge0, edge1) is a natural one: this is the current behavior of Uproot4 TH* (which should be thought of as a "sketch"; I want to converge on a common API for functions named to_numpy on histograms).

Currently, the interface (after some iteration with @henryiii) is:

def to_numpy(self, flow=True, dd=False, errors=False):

where

flow=True means that underflow and overflow bins are included (below and above the values, respectively). ROOT histograms don't have a categorical axis, but if an axis has fLabels, it can be interpreted as categorical, and maybe then the underflow should always be dropped? (It currently isn't.)
dd=False means that edges0, edges1, ... are not wrapped in a sub-tuple, following the interface of NumPy's histogram and histogram2d, dd=True means that they are wrapped in a sub-tuple, following NumPy's histogramdd.
errors=False drops errors/variances, errors=True turns the values into a tuple: (values, errors). Currently, Uproot4 is returning errors, not variances, and this was motivated by TProfile. TProfile has an additional parameter:
error_mode, which can be "" (kERRORMEAN), "s" (kERRORSPREAD), "i" (kERRORSPREADI), "g" (kERRORSPREADG), which just goes to show how complicated specifying errors can be.

On variance vs errors, I'm leaning toward variances: you have more choices to make when turning variances into errors (I'm thinking about defining degrees of freedom as N or N - 1), variances are additive, so variances are more "primary." On the other hand, users want errors more directly, for instance, for a quick plot.

What about the following? errors=None by default, but it's not a boolean, it's an option that both specifies whether you want errors or variances, and if errors, what kind of errors, such as to TProfile's options. Such a parameter shouldn't be called "errors", though.

HDembinski commented 4 years ago

On variance vs errors, I'm leaning toward variances: you have more choices to make when turning variances into errors (I'm thinking about defining degrees of freedom as N or N - 1), variances are additive, so variances are more "primary." On the other hand, users want errors more directly, for instance, for a quick plot.

I hope you read my arguments in favor of variances, too, since you didn't comment on them here.

def to_numpy(self, flow=True, dd=False, errors=False)

Looks good, except that errors=True should be the default. It really makes no sense to suppress essential data by default.

HDembinski commented 4 years ago

An interface should not shoehorn things together which are different. The general case is to return values and variances. The normal counting histogram that we know and love is a special case, actually, where variance == value, because of Poisson. In general, that is not the case. We cannot make one special case dictate the interface.

Already when you fill your counting histogram with weights, variance != value, one does not even need profiles.

jpivarski commented 4 years ago

If to_numpy is going to return what np.histogram returns (in the 1D case, similarly in the other cases), then it has to not return the errors by default. That is,

values, edges = np.histogram(some_data)

values, edges = uproot_histogram.to_numpy()

(values, errors), edges = uproot_histogram.to_numpy(errors=True)   # non-default case

Oddly, it seems that you can supply weights to a NumPy histogram and not get errors or variances back. That seems to be an oversight in NumPy's API.

I saw your arguments for returning variances and agree with them. In the above, I could replace "errors" with "variances." I'm wondering if what we want is a parameter with more states than boolean, so that we can say, "return only values," "return values and variances," or "return values and errors." That way, it's as easy to get at the primary quantity, variances, as it is to get at the quantity that's more often needed for convenience, errors.

henryiii commented 4 years ago

I mostly want consistency in the API here, but I do think that you should be able to do:

values, edges = histogram.to_numpy()

regardless of the Storage, and get the same answer, and it should look like what NumPy would give you. For example, if you have a plotting routine that doesn't know how to plot errors, you would just use this (2D plotting, for example?). If you do:

_ = histogram.to_numpy(variances=True)

you obviously already expect that your histogram has a Storage that has variances recorded, which is not natural to NumPy. If the storage does not have variance information, this returns None for the variance, so that a plotting routine that does know how to plot variances can always call this, but then check to see if the variances was non-None before plotting the errors.

(Instead of plotting routine, insert any other generic Histogram consumer that does not know the exact storage used a priori).

This has a second benefit: static typing. You can do:

values, *edges = histogram.to_numpy()
(values, variances), *edges = histogram.to_numpy(variances=True)

regardless of the storage. You could never return a tuple for values by default on all storages in a method called to_numpy! But it would be okay if you ask for it with variances=True.

But I do value consistency across libraries more than the details, that's just what I think.

HDembinski commented 4 years ago

If to_numpy is going to return what np.histogram returns (in the 1D case, similarly in the other cases), then it has to not return the errors by default. That is,

I understand that, but my proposal was to return either values, edges = uproot_histogram.to_numpy() or (values, errors), edges = uproot_histogram.to_numpy() depend on what uproot_histogram is. If it is a something with a non-trivial storage, you would return the second form by default. You could still have the error keyword (which should be called variance. It could have three values

None: chose return format based on the actual type that is converted
False: don't return variance (default)
True: return variance (which would be None if there is nothing to return)

I understand how from a static typing point of view, you don't like the None option, so I am also fine without it, I just wanted to explain my position. I don't like how mypy is changing Python from a dynamic language to something that feels more and more like C++, but that's apparently where we are going.

Oddly, it seems that you can supply weights to a NumPy histogram and not get errors or variances back. That seems to be an oversight in NumPy's API.

~Yes, because they tried to shoehorn this functionality into the existing interface. :(~

Numpy's histogram is not a statistics tool. It is a bare-metal histogramming tool that has no understanding of uncertainties and what the weights mean. The computation of the variance depends on what meaning these weights have. Their convention works for them, but not for us, we make histograms explicitly for statistical data analysis.

HDembinski commented 4 years ago

But I do value consistency across libraries more than the details, that's just what I think.

Dropping variances by default is not a detail. You loose data. What would be the realistic use case for a .to_numpy() which drops the variances? In other situations you were happy to fix "mistakes" in numpy, @henryiii , remember our discussion about h[0] vs h[0, ...]?

I find the argument terrible, but I can compromise on @henryiii 's proposal. It simply means that .to_numpy() is becoming practically useless, but maybe that's ok, we don't want to convert to numpy format anyway, since it is so limited. Any real use of this would always do .to_numpy(variance=True) which is a bit more to type, but that seems what you prefer.

HDembinski commented 4 years ago

My last minor issue is the dd keyword, which should be removed. Interfaces should be minimal (@jpivarski do you remember how you criticized my original iminuit logo, the same design principles for logos also apply to APIs). The dd keyword does not add real value.

When you make this the default

values, edge0, edge1, edge2 = uproot_histogram.to_numpy()

someone who wants the edges as a tuple can simply do

values, *edges = uproot_histogram.to_numpy()

That is even less to type than adding the keyword dd=True (and makes the documentation and testing of that function simpler).

So, can we converge on this?

def to_numpy(self, flow=True, variance=False):

Note the use of singular variance instead of variances. The singular form is generally recommended in interfaces over the plural form unless one must be consistent with prior use.

alexander-held commented 4 years ago

What do you think about an extra keyword argument to return sqrt(variance) instead of variance (or in addition to it)? From an analyzer's point of view the "error" (like in the ROOT TH1 GetBinError()) may be a more familiar quantity to handle than the variance. I see the computational argument of going with variance instead, but there are applications where I'd trade the additional computation for a more familiar looking code.

jpivarski commented 4 years ago

TProfile has an "error mode", perhaps errors vs variance could be folded into a common error mode parameter?

andrzejnovak commented 4 years ago

This discussion has escaped me before and since it seems to be spread around a bit, I'm sorry if I missed something. I think the naive user expectation for a result of a method called to_numpy() is unsurprisingly something that looks like a result of np.histogram which means by default flow=False and variance/error=False. IMHO there would have to be a really good reason to divert from that.

Dropping variances by default is not a detail. You loose data. What would be the realistic use case for a .to_numpy() which drops the variances?

I think this may have been hyperbole, Raw values without errors are shown in a ton of plots.

henryiii commented 4 years ago

I'm going to make a proposal then, and then let's iterate on that:

The canonical definition is:

def to_numpy(self, *, flow : bool = False, mode : str = "numpy")

It returns exactly what histogram1d and histogram2d return, expand to ND:

values, ax0_edges, ax1_edges, ax2_edges, ...

The ~view~ mode (edit!) argument takes a string argument. Implementations are required to accept "numpy" by default, and "variance", which changes the values into a tuple, (values, variance). Boost-histogram/Hist will also allow mode="view", which will return a boost-histogram lossless view instead. mode="error" might be a useful mode too that Uproot might want to add, maybe Hist too.

Implementations are allowed to add a keyword, such as dd=True that allows the values, edges_tuple` output that histogramdd uses, but it will not be included in boost-histogram (if Uproot included it, Hist probably will too).

I think this is a reasonable compromise: to_numpy by default behaves exactly like the (by far) most common functions histogram and histogram2d, it is easy to to adapt to histogramdd manually, and it keeps the promise of actually returning the same thing that NumPy does by default. But it's also easy to get the variances, and can very simply support boost-histogram's full power if the user wants it / knows about it.

How does that sound?

jpivarski commented 4 years ago

I edited the above because I think you meant "mode argument".

flow: bool = False: yes, I think that makes a lot of sense.
dd: bool = False as an optional extension: sure. As long as nobody's asking for it, I'd just as soon not implement it. It would be simpler if we could point to only one interface; it's just a matter of tuple unpacking, anyway. (If you're working interactively in a notebook, you see the issue right away.)
mode: str = "numpy": I wonder why we have to pack so many options into one method, particularly since that means we have to switch between them with a string. (The method name is a string.) I thought the interface we were talking about would be a suite of multiple methods, with to_numpy returning what the np.histogram* functions return (with options for including or excluding flow bins, which would simulate calling np.histogram* with a bins array that ends and begins with ±∞) and other methods like values and variance returning arrays of values and variances. Are we really talking about one method, to_numpy, parametersized by a categorical to get all these different outputs?

How about

@property
def ndim(self)                             # as in boost-histogram and hist (I like it)

def values(self, *, flow=False)            # n-dimensional array of bin values
def variances(self, *, flow=False)         # same shape as values, has the variances in it
def edges(self, axis=0, *, flow=False)     # just the edges, accepting -3, -2, -1, 0, 1, 2, "x", "y", "z"

def to_numpy(self, *, flow=False):         # to_numpy is actually a derived function
    if self.ndim == 1:
        return self.edges(flow=flow), self.values(flow=flow)
    else:
        ...

def errors(self, *, flow=False):           # so is errors; provided as a convenience
    return numpy.sqrt(self.variances(flow=flow))

None of these names are taken on boost-histogram or hist objects. The new thing (for Uproot) would be to take a flow argument and have its default be False, rather than True, but I can accept that. I'd just rather not put categorically different behavior in a string-valued argument of a method. A collection of methods is a more natural way to express categorically different functionality.

Maybe errors is optional, but I'd implement it in Uproot.

Maybe the axis="x", "y", "z" is optional, but I'd want it in Uproot because of the strong association between letters and histogram axis in the ROOT world. Personally, I prefer axis as a number in -ndim <= axis < ndim.

Plurals on all names for consistency (and the fact that they all return arrays of the things named)?

I said earlier that I had no opinions on this, but I guess I did. Sorry!

andrzejnovak commented 4 years ago

+1 for optional axis in edges. Feels redundant on a TH1.

henryiii commented 4 years ago

@HDembinski will have opinions.

There are two projects. I think we seem to be discussing both here.

One is to add a Protocol that all producers (boost-histogram, Hist, uproot, Physt, ...) will adhere to, and all consumers will accept (plotting libraries, fitting libraries, etc). I proposed something that looks a lot like @jpivarski's suggestion in https://github.com/scikit-hep/boost-histogram/issues/423 - the main difference is that axes are handled differently (and support passing titles/names optionally, along with other information a plotting library might want, like circular status). I have a potential alternate idea that would not add a bunch of methods to histograms, but would have a __histogram_dict__() (or something similar) that would return a dict that could describe the data in a uniform way, without adding API to histograms and without conflicting with anything; the downside is that there are then two interfaces, users might benefit from having quick ways to access things plots need, and libraries have to learn the "library" interface to implement it.

The second project is to define what to_numpy does. It should not be the main plotting API (which it is now); it's really meant for compatibility with NumPy; if you have an interface that accepts NumPy arrays now, it should allow you to use it. For example, using the brand new StepPatch artist that @andrzejnovak wrote that was just merged into matplotlib (https://github.com/matplotlib/matplotlib/pull/18275):

# h is a 1D histogram
ax.stairs(*h.to_numpy())

If you want to add errors, by the way, you could do:

ax.stairs(*h.to_numpy())
if h.errors() is not None:
    ax.errorbar(*h.axes.centers, h.values(), yerr=h.errors())

If to_numpy() returns tuples or views or anything besides a NumPy array of values, then the above example will not work, and you'll have to restructure your values plotting to handle the change in output. (But, as I point out above, optional arguments could be added).

HDembinski commented 4 years ago

The uniform interface idea does work well with boost-histogram, because bh is so flexible. And it violates the orthogonal design. In bh, the accumulator view offers the interface to access things like value and variance. The histogram itself does not need to know what the view provides. Adding methods to the Histogram itself breaks this orthogonality. bh would need more than value and variance to represent its rich set of accumulators. Think of profiles, where you might want the variance of values in a bin or the error on the mean. To compute the latter you need to divide by the number of counts per bin.

The histogram_dict is going in a better direction, but an accessor class is more flexible and efficient. The dict would store values, which may require time to compute values for some fields which may then be not used (for a profile, accessing the variance requires a computation, wasted time if the variance is not used to draw an error bar). The accessor class can provide the uniform interface through methods, which only compute values when they are requested.

henryiii commented 4 years ago

The dict would store views, or it could be dict-like (which is how __dict__ is implemented, btw, it's not a true dict). A dict is generally nicer to query programmatically than creating another object, and arbitrary libraries can all create dicts.

By the way, none of this would remove .view(), that still represents the full power of boost histogram, a view into all the data. What we are trying to design is a reduced set of functionality, but one that is common between all histogramming libraries. it would not need to fully implement the power of boost-histogram's accumulators; that could be putting too heavy a requirement on all other libraries.

jpivarski commented 4 years ago

I think you mean it "does not work well". Okay, not all boost-histogram objects have variances, and the accumulator may be a combination of value and variances or we don't know what because a user can put anything at all in there. Makes sense.

So let's think back to the motivation for this protocol: it's to reduce confusion among users who use all three libraries or any pair. A choice like "no underflow/overflow by default, and the option to turn that on is spelled 'flow=True'" goes a long way toward reducing that confusion because that's a little detail that users aren't likely to be thinking about until it causes problems. (Using -np.inf and np.inf as edges endpoints has confused some people so far.)

What about

if a bh object doesn't have variances, the method raises an error?
if a bh object doesn't have variances, it doesn't have a variances method?
variances and methods that don't apply universally are all optional, not implemented by boost-histogram, but implemented by Uproot because ROOT histograms always have these things?

It might not be exactly the same, and forcing it to be the same in cases that don't make sense defeat the purpose of reducing user confusion. But let's at least get the spellings the same. (E.g. plural for "values", "variances", and "errors", right?)

henryiii commented 4 years ago

I'll let @HDembinski answer, but I just want to point out, the method should raise an error or return None, not "be missing"; all boost-histograms are a single class, and having methods be dynamic __dict__ items is a typing nightmare.

jpivarski commented 4 years ago

I'll let @HDembinski answer, but I just want to point out, the method should raise an error or return None, not "be missing"; all boost-histograms are a single class, and having methods be dynamic __dict__ items is a typing nightmare.

I thought so, but I mentioned it as a logical possibility. It can be statically typed by defining the right class hierarchy, but that may be more trouble than it's worth.

My main point is let's at least get agreement on the spellings and defaults, since those are the points that cause confusion.

andrzejnovak commented 4 years ago

I think the implementation is less important than a uniform public facing API. I like @jpivarski 's draft above, with few modifications.

@property
def ndim(self)                             # as in boost-histogram and hist (I like it)

def values(self, *, flow=False)            # n-dimensional array of bin values
def variances(self, *, flow=False)         # return None if missing
def edges(self, *, axis=0, flow=False)     # axis should be optional, unecessary typing for TH1s

def to_numpy(self, *, flow=False, variances=False, dd=False):         # to_numpy is actually a derived function
    if self.ndim == 1:
        if not variances:
            return self.values(flow=flow), self.edges(flow=flow) # values first
        else:
            return (self.values(flow=flow), self.variances(flow=flow)), self.edges(flow=flow) 
            # Not sure if better like (a, a, a) or ((a, a), a)
            # tuple in tuple is good bc it's more static aka h, bin = H.to_numpy() will still work, but it's
            # actually more likely to cause confusion down the line
    else:
        ...

# conveniences - great if implemented, but not a big deal
def errors(self, *, flow=False):           # so is errors; provided as a convenience
    return numpy.sqrt(self.variances(flow=flow))
def centers(self) # bin centers

jpivarski commented 4 years ago

Adding centers in addition to edges is a good idea; that's going to be frequently desired. centers can never have under and overflow bins, so it fits well with flow=False being the default, rather than the other way around.

I don't know why we want to_numpy to return variances. NumPy's histogram function does not return bin variances (even when it takes weights), so adding this to the to_numpy function makes it unlike what NumPy does. That's why I'm proposing other named methods for values and variances, so that the user doesn't need to know the details of how these things are packed into a tuple in order to unpack them, especially if that packing would have to be different from a well-known standard.

(The issues that periodically revitalize this question are almost always due to the return values having an unexpected shape; histogram2d-like vs histogramdd-like and the off-by-one error raised by edges vs centers.)

So far, the things that I've learned are:

methods are better than properties for these things, least of which so that the presence of under/overflow bins can be parameterized
flow=False should be the default, not flow=True
the output format of histogramdd is surprising; histogram2d is much more common; it could be parameterized with dd=False, but I'd just as soon keep things simple (address the issue for people who expected histogramdd-like, not for people who expected histogram2d-like)
the fencepost principle implicit in edges is confusing, so adding a centers is a good way to relieve that
ndim is a good idea, and it should be a property, not a method
edges should have a default axis=0, certainly for 1-dimensional histograms, maybe for all numbers of dimensions, just to be uniform
this one is my own: if we have named methods for edges, centers, values, variances, and errors, let's pluralize them all, not just some of them.

Even if we don't settle on a single API, having the same defaults for things like flow=False and dd=False will help a lot. Unexpected return values that are "off by one" are more confusing than Uproot having a variances method because ROOT does and boost-histogram objects not having a variances because Boost.Histogram is more flexible in its Storage. That's essential complexity, not accidental: differences in API because the things you're trying to describe have differences, not because people randomly chose different things because they didn't talk to one another.

andrzejnovak commented 4 years ago

Hmm, I think the to_numpy() discussion stems from two implicit meanings dump_(to_numpy)_like_histogram_output and dump_available_info_(to_numpy)_arrays, but I suppose if we have intuitive methods for everything then to_numpy can just be the former.

jpivarski commented 4 years ago

I had only been thinking about the former meaning, I guess.

henryiii commented 4 years ago

I do not think I would not implement a function like "centers". This is meant to be a minimal set of functions a plotting library can depend on always being present. A histogram library can choose to add conveniences, but we are not building a user facing interface - an excellent one already exists in boost-histogram. h.axes[0].centers, h.axes[0].edges, etc.

My original proposal was to make .edges a tuple-like of axes (can be a generated property). That way, you can have names/labels attached to them, you can have centers, etc, without polluting the main histogram namespace and worrying about axis=0; you get negative indexing for free that way too. h.axes[0].edges would be the guaranteed API, with .centers, .name, .label being optional. len(h.axes) could be a replacement for h.ndim, but since .ndim is common, we could keep that as part of the API, as it could be cheaper to call for some libraries (it is not for boost-histogram). If the idea is to force this to be a method call, I could live it with, but I think the tuple is better.

(It also avoids duplication an already nice interface, and users probably would gravitate to using the h.edges directly, and then asking for it to do everything h.axes.* does already, like centers, widths, etc)

.to_numpy() has always been "to a numpy histogram-like API", since NumPy's array is already the standard format for communicating data, so all methods are expected to return an array like already. NumPy histograms are not as powerful as even ROOT histograms, so .to_numpy() is necessary lossy (even beyond the data; turning axes into edges could lose information information like metadata, names, etc).

henryiii commented 4 years ago

Here's a slightly updated version of my original Protocol.

from numpy.typing import ArrayLike # requires NumPy master
from typing import Protocol, Optional, Tuple, Union, Iterable, Literal

class PlottableAxis(Protocol):
    edges: ArrayLike

     # For non-categorical axes, this returns None
    categories: Union[Iterable[int], Iterable[str], None]

    # name/title would be possible, but not required, so can't put here

class PlottableHistogram(Protocol):
    ndim: int
    axes : Tuple[PlottableAxis]

    # Values returns the array or the values array for specialized accumulators
    def values(self, *, flow : bool = False) -> ArrayLike: ...

    # Variance returns the variance if applicable, otherwise None
    # If counts is none, variance returns NaN for that cell (mean storages)
    def variances(self, *, flow : bool = False) -> Optional[ArrayLike]: ...

    #  May not be part of the final Protocol, but shown here for now
    def errors(self, *, flow : bool = False) -> Optional[ArrayLike]: ...

And, for to_numpy (which is not part of the plotting protocol, plotting functions should not be calling to_numpy, because is is lossy; title/name, circular status, and metadata are lost):

    def to_numpy(self, *, flow : bool = False): ...

I did not include any extra arguments for to_numpy as we don't have anything clearly set there, it seems. I don't like variances: bool = False as part of it, because that precludes adding errors: bool = False (what would you do if both were passed?). mode: Literal["values", "variances", "errors"] = "values" is nicer in that way, as libraries could add more required strings. dd : bool = False seems like a very trivial thing and easy to recreate in two lines, so didn't include it here, but libraries could be free to add it.

jpivarski commented 4 years ago

I like keeping this layer between the histogram, which has values, variances, etc., and the axis, which has edges and centers, so that the latter two have to be accessed through h.axes[0].edges and h.axes[0].centers. The centers, however, should be included because they are so frequently needed and (x[:-1] + x[1:])/2 is verbose and non-obvious. (See scikit-hep/uproot4#100.)

We need to be able to specify whether we want the extracted arrays to have under/overflow bins, though, and the way that it's specified ought to be uniform across values, variances, errors, axes[0].edges, and axis[0].centers. Since the first three are methods, allowing us to say flow=False or flow=True, the edges and centers on the PlottableAxis would have to be methods, too, to keep this uniform. Choosing to include or exclude flow bins changes the shapes of the arrays, and getting shapes to line up is a frequent source of errors.

(I have, in the past, used allvalues, alledges, allcenters, ... vs values, edges, centers... to distinguish between arrays with flow bins and arrays without flow bins, which doesn't require a method parameter, but I don't think this was a good idea. People still used the wrong one expecting it to have or not have flow bins; it would be hard to find the issue number, though.)

If the axis is categorical, wouldn't it make more sense for the centers to return an ArrayLike of strings, rather than having these be separately available as categories? If not, I know how to populate them: ROOT histograms always have numerical axes and fLabels is a plotting aspect. So I can work with that, but I'd be in favor of the conceptual nudge of saying that a categorical axis's values are strings; they're not numerical positions with strings to attach to the visualization. This could be accomplished in OOP form by having two subclasses of PlottableAxis:

class PlottableNumericalAxis(PlottableAxis):
    def edges(self, *, flow: bool = False) -> Sequence[float]:   # can be ArrayLike; must it be?
        pass

    def centers(self, *, flow: bool = False) -> Sequence[float]:
        # the centers of flow bins are infinite, from a literal interpretation of averaging their endpoints
        pass

def PlottableCategoricalAxis(PlottableAxis):
    # no edges

    def centers(self, *, flow: bool = False) -> Sequence[str]:   # can ArrayLike[str] be parameterized?
        # the flow=True case includes the "didn't match any predefined category" bin
        pass

Having h.variances return None when no variances exist sounds to me like a good solution to boost-histograms that don't have variances in their Storage type. The question-and-answer could be, "Tell me your variances." "I have none!" In fact, perhaps I should take advantage of this to return None for variances and errors when a histogram doesn't have a Sumw2 in ROOT, forcing the user to explicitly np.sqrt(h.values) if that's what they want. The advantage is that they would know that they're inferring variances from Poisson assumptions about the values, rather than that happening automatically because a histogram was created without a Sumw2.

For Uproot's view of ROOT histograms, I think I should also add xaxis, yaxis, zaxis as properties for axes[0], axes[1], axes[2] if a histogram has a given dimension. Boost-histogram wouldn't do this because the number of axes is a value, but in ROOT, it's a type. If a user knows they have a TH1F, accessing its axis properties through axes (plural) and [0] may not be obvious. It would be a concession to the fact that ROOT (v6) and Boost.Histogram have different views of a histogram's dimensionality: type vs value.

andrzejnovak commented 4 years ago

While having categorical axis return centers is quite clever since that's where one would put the labels, I think it would end up being confusing "Why do I get list of string when asked for centers?" I would rather, if present at all, it simply returned np.arange(len(categories)). On that note I think keys would actually be more transparent to a new user than categories (One usually gets a list of strings when calling it).

Numerical axis should definitely have centers, because generating them from edges is an annoying cookie cutter and I don't think it's good to assume this API will not be user facing. We can never cover everything a user might want to with their plots. I might, for example, want to plot histogram values as errorbars in the centers, but I want to overlay several histograms and keep everything visible, so I'd plot all values at an offset from the center within the bin. This is easily something one might want to do, but not something that any library would automate.

HDembinski commented 4 years ago

This is meant to be a minimal set of functions a plotting library can depend on always being present.

A more radical thought: is the flow keyword really needed for a Protocol? I think flow should be either always on or always off.

The case for always on: The histogram should provide all the info it was. Including the flow bins comes at no extra cost. It is up to the plotting tool whether it will use this or not.

The case for always off: The flow bins are needed to make projections work properly and they can be used as extra information to constrain a fit - although none of the established fitting frameworks currently uses this information. This makes the flow bins more alike to an implementation detail. The histogram surely needs them, but a generic plotting tool will never need them.

jpivarski commented 4 years ago

I can see the argument for to_numpy not having a flow option and always behaving as flow = False. Although it's possible to get under/overflow behavior in np.histogram by setting the bins to an array whose first and last elements are -np.inf and np.inf, that's not a common thing to do. Besides, to_numpy is lossy in other ways (dropping lots of metadata), it can be lossy in the under/overflow bins, too.

But for edges, centers, values, variances, and errors (the current list), these are not used for plotting and fitting only, but for custom logic. I've seen a lot of analysis scripts with loops over the entries of a histogram, computing a custom expression for each bin value. Arguably, they should be vectorized NumPy expressions, rather than for loops, but that doesn't take away from the fact that analysts have reasons to apply custom logic to the data in a histogram. Under/overflow bins are likely to be important in that custom logic.

In fact, scikit-hep/uproot4#91 was triggered by an exchange on StackOverflow that reminded me how useful to_pandas was. With the edges as a sparse IntervalIndex and values and variances as the two DataFrame columns, the purpose of such an object would almost always be to apply custom logic, almost never plotting. (I don't think there's an easy way to plot a DataFrame like that and have it look like a histogram, rather than a bar chart or something.)

HDembinski commented 4 years ago

For a generic lossy protocol, I would ignore flow to simplify the interface.

jpivarski commented 4 years ago

All of these duplicates will all go into scikit-hep/uproot4#167, where this interface will get figured out, once and for all.