Histogram slicing is not implemented

jlstevens commented 9 years ago

Currently __getitem__ on Histogram is raises a NotImplementedError because the slicing semantics aren't obvious. I think we can do a few useful things though:

Slicing within a bin will obviously have to discard that bin entirely.
This means that a slice range selects all the bins that are entirely within that range.

What is less obvious is what scalar indexing represents (e.g. hist[0.5]). I can think of two behaviors:

Return a histogram with a single bar should the frequency of the bin around that particular point.
Return the frequency of the bin containing the indexed value back as a scalar literal.

I think the second behavior is the most logical and consistent. Anyway, if I don't hear any objections in the next day, I'll go ahead and implement these semantics!

jbednar commented 9 years ago

I think it would be better for slicing to select all bins whose centers are within the specified range, both for consistency with how slicing works for a SheetCoordinateSystem, and to avoid systematic bias. If you always take a range up to or smaller than the specified range, on average the range will be smaller than the specified range, which doesn't seem like good practice to me. If you always take the bins whose centers are included, it's well defined, matches SCS behavior, and is unbiased on average.

I agree that returning the frequency of the bin in which this element is contained, as a scalar literal, is better than returning a histogram. You might want to think about normalization, though -- if you return the frequency (count) of the bin, you'll get different values depending on the number of bins (e.g. doubling the bin number will on average halve every bin height). I wonder if it is possible to return some value that's independent of the number of bins, i.e. treating the histogram as a discrete approximation to a probability density function, and then returning an estimated value of that function at the given point (i.e., frequency divided by bin width?), rather than a count or frequency. Or maybe that's a separate method to support?

jlstevens commented 9 years ago

I think it would be better for slicing to select all bins whose centers are within the specified range

That does seem like another reasonable policy. That said, I am not sure the returned slice should ever be bigger than the requested slice.

As for your second suggestion, I am not sure I agree either: HoloViews shouldn't be doing that kind of inference as it is quite tricky! What it can do is return portions of the data that were explicitly supplied to the element. What you suggest sounds like it could be useful but I think it would be more suitable for an explicit probability density element.

jbednar commented 9 years ago

Well, the returned slice of a Image is indeed sometimes bigger (in continuous space) than the requested one, for the same reason; I don't see why the logic would be any different here. See http://ioam.github.io/holoviews/Tutorials/Continuous_Coordinates#Slicing-in-2D. I strongly believe that the very justifiable "best approximation" principle trumps a hypothetical "always smaller" principle. The idea is just that the discretization should never introduce any systematic bias, only the essential grid-related bias due to the discretization itself, and if you always take a smaller range it's systematically rather than randomly the wrong width.

For the second one, yes, it does depend on what level of abstraction you mean the HoloViews class to be, which is up to us to decide. A histogram is already more specialized than a bar chart, for instance, allowing certain assumptions to be made about the data (e.g. that it's continuous, ordered, etc.). Yes, it might be good to have an additional class that adds an assumption that this histogram is meant to approximate a PDF, but a method seems simpler.

jlstevens commented 9 years ago

I am fine implementing the slicing semantics based around the bin centers: fundamentally, every Histogram is defined by some finite set of bin edges so you will always need some sort of policy as to whether a bin edges is to be included in the slice. Selecting whether to include a bin by whether the bin center is included in the slice does seem like a reasonable policy to me.

I think your suggestion to treat a histogram as a PDF is certainly worth noting as a separate feature request but for now, I think indexing with a scalar to get the corresponding frequency value is the correct semantics for __getitem__.

jbednar commented 9 years ago

Sounds good. When you use the scalar indexing in practice, be sure to think about whether using it as a PDF is what you really mean to do, and then if so implement it for a PDF at that time, rather than using the problematic width-dependent raw counts and then fixing them later.

philippjfr commented 9 years ago

Our Histogram element doesn't care or know about whether or not the frequencies are normed or not. We simply accept edges and frequencies (by default), and our convenience method .hist also returns absolute frequency counts. So I don't think there is an issue returning the value of the bin, since by default they are not PDFs.

I also agree on selecting by the bin center when doing slicing is the most reasonable approach.

jlstevens commented 9 years ago

These slicing and indexing semantics are now implemented in 1f913192 and 14 unit tests have been added in 10ab39e39 and they are passing (Travis should go green now unless notebook tests mysteriously fail).

Looks like I can close this issue now!

github-actions[bot] commented 2 weeks ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

holoviz / holoviews

Histogram slicing is not implemented #46