Collapsing an NdOverlay or HoloMap into a stream plot or error bars

holoviz / holoviews

With Holoviews, your data visualizes itself.

https://holoviews.org

BSD 3-Clause "New" or "Revised" License

2.69k stars 403 forks source link

Collapsing an NdOverlay or HoloMap into a stream plot or error bars #103

Closed jbednar closed 8 years ago

jbednar commented 9 years ago

The recent addition of error bars is great, but they currently have to be constructed explicitly, which can be error prone (no pun intended). Since we encourage people to collect stacks of data into containers like NdOverlays and HoloMaps, it would be great if we could very easily tell HoloViews to create a summary plot by collapsing those data structures along a certain dimension by calculating the mean (or median?) value along with error bars (or stream widths, once stream plots are supported). This should be a nice way to move back and forth between the full data and reduced versions thereof, so that people can get a nice intuition about how well the summary statistics represent the underlying data.

philippjfr commented 9 years ago

This is mostly already possible although there's some awkwardness about constructing the ErrorBars as you point out. I propose the following should be possible:

from itertools import product
length = 10
x = np.linspace(0, 1, length)
a = 10
b = 5

hmap = hv.HoloMap({(a, b): hv.Curve(zip(x, 0.01 * np.random.randn(10) + 1))
                   for a, b in product(range(a), range(b))}, key_dimensions=['a', 'b'])

hmap.collapse(dimensions=['a'], function=[np.mean]) *\
hmap.collapse(dimensions=['a'], function=[np.mean, np.std], collapse_type=hv.ErrorBars)

Here we've created a HoloMap with Dimensions a and b containing Curves with 10 samples between 0 and 1. We can already use the collapse method to collapse the objects along either or both dimension with a supplied function. I've extended this here to allow you to supply two functions and a new type to wrap the collapsed data in.

I did suggest to @jlstevens that we could make a change to the Curve, Scatter and Bars Elements to allow them to accept the error as a second value dimension. This is particular important for Bars because it would be an absolute nightmare to position errors on stacked bars correctly. It would also draw a clear distinction between Scatter and Points. Scatter along with Curve and Bars is for data with consistent sampling, which we can define an error for. Points and Contours on the other hand are just points positioned in a 2D coordinate system that cannot be averaged, they can only be binned. We already provide 1D rebinning so we should consider also adding .hist2d to provide 2D binning. I think overall this will go some way to make the role of particular Elements more clear, which also means the operations you can perform on the data are more well defined.

The error bars can very easily be drawn by the CurvePlot, PointPlot and BarsPlot because their respective matplotlib calls all accept a yerr argument. We would however allow only minimal customization of the error bars drawn by these plot types. If you need fancy error bars you'll need to go back to the regular ErrorBars object. However to get the quick overview you requested, the example from above would reduce to:

hmap.collapse(dimensions=['a', 'b'], function=[np.mean, np.std])

As an aside: It's unclear how the proposed Spread element would fit into this scheme unless it stores the data in the same way as ErrorBars does, i.e. as an Nx4 array of x-values, y-values (the mean), negative yerror, positive yerror (can also be supplied as Nx3 with symmetric error).

jbednar commented 9 years ago

That plan for the proposed Spread element sounds fine to me; it seems like a reasonable set of arguments. Having the basic matplotlib-based error bars come out easily while customizable ones require ErrorBars seems reasonable as well, as does allowing those other types to accept error as a second dimension (though that would be something that only matters behind the scenes, for the operation I'm proposing, since I'm hoping the error-bar handling will not need to be done explicitly).

It's hard to appreciate what the results from the proposed collapse operation would be, since for the example above the error bars seem to go off the plot (?). The proposed syntax hmap.collapse(dimensions=['a', 'b'], function=[np.mean, np.std]) (or maybe better [np.mean,scipy.stats.sem]?) seems compact and reasonably clear. I'm not sure what the 'b' dimension means here -- don't we only need to specify which dimension to collapse ('a'), even if we collapse it twice, once to get the mean and once to get the stddev? I must be interpreting that incorrectly...

jbednar commented 9 years ago

Oh, and yes, your comments about Points vs. Scatter make sense; collapsing Scatter across the Y dimension is well defined, but Points would have to be binned, whether in 1D or 2D.

philippjfr commented 9 years ago

I'm not sure what the 'b' dimension means here -- don't we only need to specify which dimension to collapse ('a'), even if we collapse it twice, once to get the mean and once to get the stddev? I must be interpreting that incorrectly...

'a' and 'b' here should be interpreted as two independent variables which we're varying to get observations of the y-values. For a concrete example imagine a bunch of size tuning curves and 'a' is the number of the neuron and 'b' the contrast. Collapsing over 'a' and 'b' means to apply the supplied function across all curves, reducing them to a single curve. Collapsing over just 'a' (or neuron #) would give you the average size tuning curves across the population for each contrast, while collapsing over just 'b' (or contrast) would give you a size tuning curve for each neuron averaged over contrasts.

Edit:

It's hard to appreciate what the results from the proposed collapse operation would be, since for the example above the error bars seem to go off the plot (?).

Yea ErrorPlot.get_extents needs to be improved and we really should add a padding plot option so the plots axes don't snap so tightly to the data.

philippjfr commented 9 years ago

And yes scipy.stats.sem seems to work correctly:

hmap.collapse(dimensions=['a', 'b'], function=[np.mean]) *\
hmap.collapse(dimensions=['a', 'b'], function=[np.mean, scipy.stats.sem], collapse_type=hv.ErrorBars)

jbednar commented 9 years ago

In your size-tuning example, collapsing over 'a' seems like the one operation that really makes sense. Collapsing over contrast in that case would be perverse, since the behavior with contrast is not expected to be distributed even approximately normally about the mean, but I suppose one could do it (and quickly decide not to do that again! :-). And if collapsing across 'b' is misleading, then collapsing across both 'a' and 'b' in this example would be equally bad.

Such double collapsing also doesn't seem appropriate if 'a' and 'b' are coordinates in some uniform space, either, because then we'd be dealing with a Points-type plot where the coordinate (a,b) is together uncertain, which would need spatial binning rather than separately collapsing the dimensions. So I'm having trouble thinking of an example where 'a' and 'b' are both truly good candidates for such collapsing, but I assume that there must be some such example (leading to both horizontal and vertical error bars?).

In any case, it sounds like my intuition was correct that we don't need to specify two dimensions if we were simply collapsing a HoloMap of Curves down to a single Curve with error bars. For an hmap that's a stack of such curves, presumably hmap.collapse(dimensions=['a'], function=[np.mean, np.std]) would work? In which case maybe the dimensions can be omitted, as there's only one, and so would hmap.collapse(function=[np.mean, np.std]) work? If so that seems very clean and nicely defined: if you have a HoloMap of Curves, and you want to see a summary of it, just collapse it with the specified functions and you'll get a Curve + error bars.

Our tutorial documentation of such an operation should presumably have examples (even if only pointers to scipy.stats) of having error bars being stddev, sem, and 95% confidence intervals, since so many people will want to do one of those three things for published plots.

jbednar commented 9 years ago

I guess I'm still confused -- variance in both 'a' and 'b' shouldn't lead to horizontal and vertical error bars, because in both cases it's uncertainty for the 'y' dimension (neural response, in your example). So I'm not sure how those two types of variance are being combined in your ['a','b'] example. Seems like they should result in two different sets of vertical error bars (in different colors, perhaps?), not a single set of bars as shown. Actually, once one dimension is collapsed to something with error bars, won't collapsing the remaining one require error bars on the first set of error bars, indicating how much variation there is in the error from 'a' as 'b' varies? This is hurting my head!

jlstevens commented 9 years ago

When I started reading this, I thought using collapse this way might be a good idea, but now I am not so sure. More on that later on. What I do agree on is that ErrorBars need to be easier to use.

Firstly, I want to discuss the idea of a second (third? fourth?) value dimension of Chart elements that makes use of the yerr option in matplotlib. This is important to this issue as the functionality we support in chart elements will determine how we map to them from a holomap/ndoverlay.

Anyway, I was almost convinced this was an passable idea but then I asked myself the following questions. If the answer is 'no' to any of them, the more unhappy I am with the idea.

Are there any other type of auxillary data, other than yerr that are standards argument accepted by matplotlib for chart plots?
If yerr is supported as well as other things, do we support the other functionality? For instance I believe some matplotlib commands support colour (c) arrays?
Can we imagine anything else worth doing with additional, auxillary value dimension?

Always mapping to error bars no longer seems a good idea to me even if it is one fairly common case. Instead, this makes me think of the extra value dimensions we support on Points (mapping one or both of them to size/colour). Maybe error bars could be a default option but not the only option.

So is what we are looking for really a system of mapping auxillary value dimensions onto different display variables: color, size or error-bars? We have something like this for Points already but this may not be the best system.

The more I think about this, the more I am against supporting yerr at all because I think it is starting to break the compositional design of HoloViews. Here is my reasoning:

If the case of Points, I think additional dimensions is acceptable because colour and size are properties of the atomic elements of the visualization (the points). In contrast, error bars are separate and optional elements that are overlaid whether we do the overlay or not. Note: just because matplotlib does something for convenience doesn't necessarily mean it is a good idea and that we should as well!
The discussion on analysis and semantics illustrates to me why this is a bad idea. If you have 'raw' uncertainty values associated with each data point when I suppose additional value dimensions may be reasonable. On the other hand, errors are often derived data which makes it a bad idea (see next point).
If your error bars are the output from analysing other data, using errors in this way is not fine! You shouldn't be trying to fiddle your arrays such that you must be careful not to touch columns 1 and 2 (x and y) which are the raw components of your data while mucking around the remaining columns to show analysis! This is why the ErrorBars exist as a separate element for overlay.

In short, I think yerr is only acceptable if you have uncertainly that is intrinsically associated with each point (i.e collected at the same time). If you want error bars that are computed from an analysis, this becomes a terrible idea - having an overlay element keeps the analysis separate from the input data instead of mixing it all up. In other wordsyerr is acceptable in some situations yet very inappropriate in others. If we can make ErrorBars easier to use, do we really want two separate systems to display them?

Anyway, I've not talked about collapse yet but my objection is essentially the same - in any real situation (as Jim has shown by getting confused!) your error bars must be thought out properly. First you need to think about your data and what it represents, secondly you'll often (though not always) have some complex analysis which won't be as simple as np.std or scipy.stats.sem. Even then, it isn't clear that yerr is the only way of displaying such data.

I have some vague ideas about how we could improve the usability of error bars without tying ourselves in knots but I've said enough for now!

jlstevens commented 9 years ago

I should probably give you my current, recommended answer to this issue so we can also discuss it!

My solution would be simply to write an operation that takes in HoloMaps or NdOverlays and outputs the appropriate ErrorBars overlaid on top of the data. The operation would handle positioning the bars correctly on top of the data.

Failing that, I can imagine a classmethod on NdOverlay that outputs an overlay including the ErrorBars. For a HoloMap, I feel the correct transformation (semantically at least!) is HoloMap --> NdOverlay --> Overlay (last step using the proposed classmethod). I believe you can pass your ndoverlay straight into your holomap, so I can imagine something along these lines:

hmap * NdOverlay(hmap).errorbars(hmap)

jbednar commented 9 years ago

That discussion sounds reasonable in general, but my understanding of having ErrorBars be separate appears to be directly opposed to what you are saying. I'm arguing that it's dangerous to have some separate set of error bars divorced from the (averaged) data, because it's very easy to get confused and have those values not actually correspond to the (averaged) data. Computing both the average and the error bars automatically from a set of Curves (e.g.) was supposed to make this process much more foolproof -- the mapping from the HoloMap to a single averaged Curve with error bars is meant to be transparent and very clearly specified, as opposed to the user separately building up the averaged Curve and a separate ErrorBars element and then combining them. Such combinations are dangerous because any curve can be combined with any other curve or set of ErrorBars, so there is no tight linkage between the averaged curve and the associated error bars. What I'm proposing is that there be such a linkage, very clearly deriving (in one step) a combined Curve+errorbars object out of a HoloMap of Curves, greatly reducing the chance of errors and making it completely clear what's being visualized and summarized.

Regarding your specific recommended answer, I don't see how that would solve the issue, because the error bars don't need to be added to the HoloMap, they need to be added to the averaged curve. I.e., the HoloMap gets reduced to (i.e., generates) a Curve+errorbars; it does not get annotated with error bars itself. But maybe I am misreading your proposal?

jlstevens commented 9 years ago

I'm arguing that it's dangerous to have some separate set of error bars divorced from the data, because it's very easy to get confused and have those values not actually correspond to the data.

Having a mixture of raw data and analysis in the same numpy array is a far worse idea. Typically, losing your raw data is worse than losing your analysis! The two things must be kept separate!

. Computing them automatically from a set of Curves (e.g.) was supposed to make this process much more foolproof

I am in favour of an easier way to generate appropriate ErrorBars. I am not in favour of trying to make HoloViews foolproof - it is impossible to make any software foolproof! Instead, I want to make it easy for the user to do whatever they intend to do in a flexible, compositional way. If the user wants to do something stupid, they will!

What matters is having clear, compositional semantics so the user can easily spot if they have made a mistake. I am convinced that NdOverlays and ErrorBars are naturally related which is why I think NdOverlay is the correct place to easily generate mean curves, error bars etc.

Anyway, I agree my example was wrong for the issue you stated but that illustrates why trying to be foolproof isn't going to work! You wanted to completely collapse the holomap but my example of overlaying error bars on a holomap is also a valid thing to do. For what you are suggesting, it should be something like this:

curves = NdOverlay(hmap).
curves.collapse(np.mean) * curves.errorbars()

Edit: If you want to build compositional structures in a fixed way from some inputs, then that is exactly what operations are for.

jbednar commented 9 years ago

I'm about to sign off, but I don't see how I was proposing mixing raw data and analysis. I was proposing starting with raw data contained in a HoloMap (e.g. a HoloMap of Curves), and generating (in a single step) a completely separate object with an averaged Curve (not raw data) and error bars (not raw data) tightly and jointly combined (with each other, not with the raw data) to avoid ambiguity. The error bars are error bars about the mean; without the mean they don't mean anything (no pun intended). Yes, one could construct those two objects separately, and I have no objection to someone doing so if they wished, though it's hard for me to see the use of error bars on their own (e.g. overlaying them on a HoloMap seems strange to do if you aren't also overlaying the mean Curve). In any case, I strongly argue that Curve+errorbars (or Curve+stream, ideally) is such a meaningful unit that it should be obvious how to get it, view it, and use it as a unit. Isn't that combination nearly always what people use in papers and even just think about when reasoning about results, even if they are lazy and have left off the error bars in particular cases (which I'd hope to discourage by this proposal, by making it as easy to get error bars as to avoid them)?

jlstevens commented 9 years ago

I'm about to sign off, but I don't see how I was proposing mixing raw data and analysis.

This is referring to @philippjfr's suggestion of extra value dimensions in chart elements.

What you are talking about sound like an operation to me and very similar to one that already exists (operation.collapse_curve). If I had had HoloViews when writing the GCAL paper, this is exactly what I would have done - written a MapOperation to process the curves in an NdOverlay, returning a mean curve overlaid by a stream/error bars.

philippjfr commented 9 years ago

In your size-tuning example, collapsing over 'a' seems like the one operation that really makes sense. Collapsing over contrast in that case would be perverse, since the behavior with contrast is not expected to be distributed even approximately normally about the mean, but I suppose one could do it (and quickly decide not to do that again! :-). And if collapsing across 'b' is misleading, then collapsing across both 'a' and 'b' in this example would be equally bad.

True, my example wasn't a good one, in most situations you would only ever collapse one dimension.

I guess I'm still confused -- variance in both 'a' and 'b' shouldn't lead to horizontal and vertical error bars, because in both cases it's uncertainty for the 'y' dimension (neural response, in your example). So I'm not sure how those two types of variance are being combined in your ['a','b'] example. Seems like they should result in two different sets of vertical error bars (in different colors, perhaps?), not a single set of bars as shown. Actually, once one dimension is collapsed to something with error bars, won't collapsing the remaining one require error bars on the first set of error bars, indicating how much variation there is in the error from 'a' as 'b' varies? This is hurting my head!

No, that would be the case if you did two consecutive collapse operations. If both Dimensions are collapsed at once that means you're simply applying the function over everything. A better example would be GDP Curves by year, indexed with Dimensions 'Continent', 'Country'. Here we could find the mean GDP curve per continent by collapsing 'Country' or collapse both 'Continent' and 'Country' to get the mean overall GDP curve. 'Country' and 'Continent' are obviously linked here, which is probably when this operation makes most sense. This comes down to the fact that taking the mean of a mean is weird so multiple collapses are only useful if you have some raw data, which is grouped with some labels. You can then use it to query averages in various subgroups by collapsing the other dimensions.

Always mapping to error bars no longer seems a good idea to me even if it is one fairly common case. Instead, this makes me think of the extra value dimensions we support on Points (mapping one or both of them to size/colour). Maybe error bars could be a default option but not the only option.

At least for Bars and Curve this is the only sensible a option. I'd also argue that having this behavior apply for Scatter but not for Points would be the most useful behavior as it seems weird to have two Elements which behave in the same way and would draw a clear distinction between them. Points can have arbitrary numbers of dimensions because they are simply n-dimensional coordinates projected down to a 2D space, Scatter is regularly sampled data optionally with confidence intervals.

If the case of Points, I think additional dimensions is acceptable because colour and size are properties of the atomic elements of the visualization (the points). In contrast, error bars are separate and optional elements that are overlaid whether we do the overlay or not. Note: just because matplotlib does something for convenience doesn't necessarily mean it is a good idea and that we should as well!

This seems to me to be the most important objection to this proposal. I do however think the utility of doing this outweighs it and having two Elements Curve and ErrorBars, containing the exact same x-/y-coordinates seems excessive for the simple case.

The discussion on analysis and semantics illustrates to me why this is a bad idea. If you have 'raw' uncertainty values associated with each data point when I suppose additional value dimensions may be reasonable. On the other hand, errors are often derived data which makes it a bad idea (see next point).

I don't think this is accurate, as Jim points out the y-values of a Curve with specified uncertainty is not in any way raw either, usually it will be the mean of a number of Curves at least.

If your error bars are the output from analysing other data, using errors in this way is not fine! You shouldn't be trying to fiddle your arrays such that you must be careful not to touch columns 1 and 2 (x and y) which are the raw components of your data while mucking around the remaining columns to show analysis! This is why the ErrorBars exist as a separate element for overlay.

Again I disagree, the point is this would avoid having to fiddle your arrays manually and thereby reduce a source of error. Dealing with ErrorBars directly is really annoying because it duplicates the x- and y-values of an array. So you have to construct a Curve and an ErrorBar containing exactly the same x- and y-data. If you're just trying to get a quick overview of your data, collapse can do so quickly and easily. If you need to compute complex stuff like confidence intervals as you point out you should use operations, which will create the ErrorBars or Spread Element for you.

In short, I think yerr is only acceptable if you have uncertainly that is intrinsically associated with each point (i.e collected at the same time). If you want error bars that are computed from an analysis, this becomes a terrible idea - having an overlay element keeps the analysis separate from the input data instead of mixing it all up. In other words yerr is acceptable in some situations yet very inappropriate in others. If we can make ErrorBars easier to use, do we really want two separate systems to display them?

Right, simple collapse operations can associate the uncertainty directly with an Element. If you precomputed the uncertainty however, I agree you should declare an explicit ErrorBars or Spread Element. I do think it is worthwhile to commit to using our inbuilt data operations, which currently include .reduce, .sample, .hist and .collapse rather than introducing new methods to do the same thing in a more restricted way. These methods are for common operations you can perform on the data, while operations should be used for anything more complex.

I have some vague ideas about how we could improve the usability of error bars without tying ourselves in knots but I've said enough for now!

Again I disagree any of this would tie us in knots, I think it actually makes it more clearly defined what each Element represents.

Failing that, I can imagine a classmethod on NdOverlay that outputs an overlay including the ErrorBars. For a HoloMap, I feel the correct transformation (semantically at least!) is HoloMap --> NdOverlay --> Overlay (last step using the proposed classmethod). I believe you can pass your ndoverlay straight into your holomap, so I can imagine something along these lines:

HoloMap does have the .overlay method which split any or all dimensions out into an NdOverlay. The problem with using a method like this on NdOverlay would be that if you have a HoloMap like my example above indexed by 'Neuron #' and 'Contrast' and you want to overlay just the 'Neuron #' to compute an average size tuning curve of the entire population for each contrast then your example above would have to look like this:

hmap.overlay('Neuron #').map(lambda x: x.errorbars(), [NdOverlay])

While collapse would just be:

hmap.collapse(['Neuron #'], [np.mean, scipy.stats.sem])

Collapse does the groupby operation and the actual collapse in one go, while a method on NdOverlay would require you to perform the groupby first and then apply the errorbars operation on each NdOverlay separately.

philippjfr commented 9 years ago

I'll expand a little bit about how making the distinction between these Element types can make operations on the data more well defined.

The two types of data include:

Raw data: Points, Paths, VectorField, Contours, Scatter3D (might consider renaming), Polygons
Regularly sampled data: Scatter, Curve, Histogram, Bars, Table, Image, Raster, HeatMap, Surface

And these are the operations we already support or may consider supporting and how they relate to the two types of data is listed above:

Sampling (.sample) is only defined on regularly sampled data.
Collapse (.collapse) is only well defined for regularly sampled data, currently it will simply combine all raw data into one Element, i.e. Points are concatenated, Paths/Contours merged
Binning (.hist and maybe .hist2d) can only be applied to raw data. Edit: needs a weighted option to support real binning.
Resampling (no corresponding method) can only be applied to regularly sampled data.
Reduce (.reduce) can be applied to anything as it's equivalent to binning into one bin and resampling to one point. Edit: Not quite accurate, reducing only one dimension on Points for example does not work because they would have to be binned (in 2D) first, only a reduction to a single point would be supported.

jlstevens commented 9 years ago

The length of the discussion here shows how confused everything is getting. I find most of the suggestions proposed quite unreadable, I don't see how these concepts can generalize in a nice way and they make little or no intuitive sense to me. I am not willing to complicate a nice compositional design because of matters of convenience - things can be improved and I am certain there is a clean, elegant way of making everyone happy.

The core problems seems to be in the design of the ErrorBars element. The objection is that ErrorBars duplicates/must match information in a Curve (for example). This is a perfectly valid concern which I think points to a problem with the design of ErrorBars - but this shouldn't require us to complicate anything else (unless there is a compelling case that involves supporting entire families of elements of which ErrorBars is one instance).

In short, the problem is one of encoding. If we want the errors to be robust to changing x,y position then the simple solution is not to encode that information in ErrorBars in the first place. I can try and make some suggestions for how ErrorBars could work that would be less problematic (each with pros and cons):

Delta only: errors are encoded as y-deltas. When overlaying error-bars, the desired behavior would be for the errors to snap to the appropriate point (essentially act like a zip operation). I realize there would be implementation issues but I am only brainstorming ideas here!
Delta + validation: Data is often sampled regularly so you could associate the x-position only for each delta. This would be more robust than the zip-type operation above, making sure the error is at the right x-position.
Complete encoding: What we have now with redundant information between the Curve and the ErrorBars. This is problematic as we have discovered!

I am not willing to have operations such as collapse become more complicated at this time. Most of these methods need far more documentation as it is (several notebook tutorials worth!).

Although I have a vague idea of what these methods so, I am not happy with the fact that Philipp is the only person on Earth who has any decent insight into what is going on. Until I see a lot more documentation clearly demonstrating that sample, collapse, reduce, collate etc are sensible sane things (I am not yet convinced!), I don't want to see these things get more complicated. I only want to discuss these methods after I've seen a lot more documentation and a lot more tests.

My feeling is that we should only consider improving the ErrorBars element in any way necessary to make our life easy before we should consider anything else (other than some simple operations or helper methods on NdOverlay)!

If you keep things simple, you can complicate things as necessary after sufficient consideration. If you rush to complicate things, it is very hard to turn back.

jbednar commented 9 years ago

Philipp, is the term "regularly sampled" actually appropriate here? Images are regularly sampled, on a grid, but don't Curves, Bars, etc. support an arbitrary set of x values? I don't think there's anything regular about such sampling. I can't think of an appropriate name, though.

Jean-Luc, obviously, I don't think we should overcomplicate anything. I think my original proposal remains clear and not complicated: we should easily be able to collapse a HoloMap of Curves (and similar Elements) into a single Curve (or similar Element) with tightly associated error bars, in one step. Whether we do that using an operation or a method is not clear to me, but I don't see anything about the functionality that should require some messy solution.

Although it's an important general principle, I think that Jean-Luc's objection to combining raw and derived data is invalid in every real case I can think of. In my proposed collapsing use case, both the curve and the associated error bars are derived data, as I mentioned. You might also have raw data that has associated error bars, e.g. if you collect from some device that reports both a value and a confidence value, in which case both the curve and the associated error bars are both raw data. In both cases, I think there should be a very strong link between the curve and the associated error bars, packaging them together in a way that avoids mismatches. Whether to plot them using matplotlib's yerr features is not clear to me, but the goal of a clear association is clear to me, and I really can't see any danger of mistakenly associating analysis results with raw data. In fact, what I'm proposing should make such bogus association less likely, because people would have to go out of their way to associate some arbitrary error bars with some arbitrary data (which of course they can do with an Overlay if they like and if they know what they are doing). Formally associating data in the software that really is tightly semantically coupled (as raw data with confidence values or mean data with sems both are) helps make it more obvious when someone's trying to do some bogus arbitrary association.

One minor note -- we have to be careful not to assume that the number of error bars is the same as the number of samples in a curve. For a curve with hundreds or thousands of samples, we have to decimate the number of error bars for it to be readable (which is one reason why stream plots are so nice, since they don't have that restriction). So even though the error bars have the same domain as the curve, they probably still have to specify the x positions somewhat redundantly, unlike what you guys seem to be assuming above.

Jean-Luc, I completely agree that we need more documentation of the transformation code. The Transforming Data tutorial would be a first step, and it's way overdue. I'm guessing that the underlying reason Jean-Luc is not comfortable with that code is that it's not been demonstrated and documented even now, despite it being quite crucial for using HoloViews, in my opinion. All of the transformations listed above seem fundamental and extremely useful and in principle well defined, so I think we just need to get them documented and perhaps tidied up so that we can use them in earnest.

Philipp, I now see what you mean by collapsing multiple dimensions at the same time; that indeed does make sense if one is using hierarchical groups like Continent and Country, rather than independent dimensions. That's something that would make a good example at the end of a tutorial, after the much-more-common case of collapsing across a single dimension.

philippjfr commented 9 years ago

I'm working on the Transforming Data tutorial right now so I won't say too much more here. Just replying to two points.

Philipp, is the term "regularly sampled" actually appropriate here?

No it's not, technically what I mean when I said "regularly sampled" is that it's data with a defined and consistent sampling or binning, or categorical data.

One minor note -- we have to be careful not to assume that the number of error bars is the same as the number of samples in a curve. For a curve with hundreds or thousands of samples, we have to decimate the number of error bars for it to be readable (which is one reason why stream plots are so nice, since they don't have that restriction). So even though the error bars have the same domain as the curve, they probably still have to specify the x positions somewhat redundantly, unlike what you guys seem to be assuming above.

The subsampling of error bars can happen on the level of the plots, the matching plotting functions have stride arguments for that purpose. That way you don't have to apply some destructive operation on your data to get it to display nicely, data and visual representation are separated.

jbednar commented 9 years ago

The stride arguments sound good; that's the missing piece that makes the discussion of sharing x's make sense.

As for "regularly sampled", I don't think "consistently sampled" makes sense either, since it too suggests regularity. "Data with defined sampling" doesn't convey much either. It's a tough concept to name, because there's not even a finite set of possible x values -- Curves can always add a new x whenever. Tricky!

jlstevens commented 9 years ago

I think my original proposal remains clear and not complicated

Right, I say that (right now at least) it should done using an operation. Edit: The requested feature isn't complicated but I wasn't able to wrap my head around Philipp's suggestion involving the collapse method.

I think that Jean-Luc's objection to combining raw and derived data is invalid in every case I can think of ...

If you don't think you would ever want something such as a raw trace surrounded by error bars for (say) the standard error of the mean (computed across a collection of such traces) curves, then I can agree.

Even if I am convinced that this is never desirable (and I withdraw this particular objection as a result), I am not comfortable with having two different ways of doing something very similar 1) the 'fast' and limited way using yerr 2) a separate, more customizable and slightly more involved way using ErrorBars.

Jean-Luc, I completely agree that we need more documentation of the transformation code.

I think it is critical that we document it properly before extending any of this code further. I am fully open to possible extensions, but only once we are all very clear about what is and isn't possible right now (which we will only realize by writing some proper documentation).

In addition to writing an operation (the correct approach in my opinion!) we haven't discussed the fact you can define a compositor/operation pair that automatically visualizes the mean and error for a bunch of overlaid curves. This is one possible use of the compositor system.

Finally, for me the objection that there "should be a very strong link between the curve and the associated error bars" doesn't hold much water. For instance, when I use PinwheelAnalysis to compute pinwheel locations for an orientation map, what is returned to me are some Points overlaying the input preference map (an Image).

Here there are two elements together in an Overlay and other than that is no other link between the two (other than the fact that the analysis operation returned the two elements together). I see no difference (in principle) between this and an Overlay of some ErrorBars over a Curve.

Of course, in my pinwheel example you could somehow end up with pinwheels from one map overlaying another map by accident. I want to emphasis that something like this has never happened to me unless I've explicitly wanted something like that to happen (I can imagine overlaying direction map pinwheels on an orientation map for instance).

jbednar commented 9 years ago

If you don't think you would ever want something such as a raw trace surrounded by error bars for (say) the standard error of the mean (computed across a collection of such traces) curves, then I can agree.

It's hard to imagine that being useful. One raw trace is quite arbitrary. A collection of such traces could be useful, though. E.g. if one has a stream plot summarizing 1000 curves, it might be reasonable to overlay 5 or 10 sample curves on top of that to show what individual curves look like compared to the overall trends. But that's the sort of operation that I think people should be doing explicitly, building it up by hand to show specific things, not the sort of automatic operation I'm describing.

In addition to writing an operation (the correct approach in my opinion!) we haven't discussed the fact you can define a compositor/operation pair that automatically visualizes the mean and error for a bunch of overlaid curves. This is one possible use of the compositor system.

I have no opinion between a compositor, an operation, or a method.

Finally, for me the objection that there "should be a very strong link between the curve and the associated error bars" doesn't hold much water. For instance, when I use PinwheelAnalysis to compute pinwheel locations for an orientation map, what is returned to me are some Points overlaying the input preference map (an Image).

Sure -- as I say, I have no opinion between operations, methods, or whether the data is part of an Element. What I mean by a tight linkage is at the user level. User code should just do one invocation to get a combined Curve+errorbars object. Whether that's an Overlay, a Curve that inherently supports error bars, or some other object doesn't much matter to me in the case of collapsing a HoloMap; what's important is that the user is not typically ever forced to explicitly combine an error bar object with a mean object. Users think of the HoloMap as one object, and I want them to think of getting one object comprising the Curve+errorbars out of the HoloMap, as a single operation, so that the two parts never get separated as far as the user normally is concerned. Sure, this object can be an Overlay; fine. Or if Curve supports errorbars internally, fine by me. Just don't ask the user to paste two things together, because then it's a recipe for confusion, complication, and errors.

So for my case of collapsing a HoloMap down to a curve+errorbars I don't have a preference between an Overlay-based and a richer Element approach. Where it would matter is the case of raw data that has associated confidence values. In this case, the Element approach seems like a very clear win. User code can then very easily associate an error bar with each data point as it is collected, rather than pasting a Curve and an ErrorBar Element together. We don't normally in our own work have raw data with inherently associated confidence values, but I do think that they are reasonably common in some fields, and so may be worth supporting well.

jlstevens commented 9 years ago

Ok, I think we mostly agree now!

What I mean by a tight linkage is at the user level.

This is where operations become useful - users should implement operations to avoid repetitive/error-code and we should also supply as many operations that we think will be useful to everybody.

I don't have a preference between an Overlay-based and a richer Element approach.

If we support both, we need to make it clear to the user which approach should be used and when. For instance, I can imagine a user using the richer Element approach, wanting to customize the display more than is feasible via yerr and then having to switch to the element based approach.

I think this concern could be alleviated if we have a nice interface to easily convert between the two approaches (i.e fold an ErrorBars into a Curve or split one back out again). I think this might make the redundancy of the two systems a bit less worrisome.

philippjfr commented 9 years ago

I think this concern could be alleviated if we have a nice interface to easily convert between the two approaches (i.e fold an ErrorBars into a Curve or split one back out again). I think this might make the redundancy of the two systems a bit less worrisome.

If we stick with the redundant data format for ErrorBars and Spread, i.e. an Nx3/4 with x/y-coordinates, lower error & upper error or symmetric error, then splitting a Curve Element with values and errors into separate Elements could reduce to:

curve.select(value='y') * hv.ErrorBars(curve)

Which could become an ElementOperation so it can be mapped over a HoloMap of curves.

philippjfr commented 8 years ago

My new suggestion here would be to add an explicit errorfn argument to collapse, reduce and aggregate methods which will compute the error for each value dimension and inserting the corresponding column into the original object. To be usable it would have to give the columns sensible names though, if we used the numpy, scipy functions __name__ attributes, we could just join the dimension name with the function name to get something like 'Activity sem' when reducing vdim 'Activity' with the scipy standard error function.

When providing an errorfn the operation could then return an overlay of a Curve and ErrorBars or Spread, pointing to the same columns data, so the data wouldn't actually be redundant.

jbednar commented 8 years ago

I'd avoid the name errorfn, because it sounds like an exception handler. ErrorBars doesn't, because that's a common term and people know what it means, but I don't think people will make that leap for errorfn. errorbarfn is longer and has the word barf in it, so it's not ideal, but even so it's still probably better. :-) Hopefully there's something better than either one.

philippjfr commented 8 years ago

Something better would be nice, errorbarfn is problematic because technically it could also use the Spread element instead but it's still better than errorfn.

jbednar commented 8 years ago

Maybe variancefn? Informally, that works because this quantity is something to measure how much the signal is varying, whether it's Spread or ErrorBars; the only problem is that it's not necessarily directly related to the statistical definition of variance. I guess spreadfn isn't too bad, since ErrorBars are a measure of spread...

philippjfr commented 8 years ago

Okay, I've now implemented a spreadfn argument to the aggregate, reduce and collapse methods. Here's a few examples of the usage:

hmap = hv.HoloMap({i: hv.Curve(np.arange(10)*i) for i in range(10)})
collapsed = hmap.collapse(function=np.mean, spreadfn=np.std)
hv.Spread(collapsed) * collapsed + collapsed.table()

img = hv.Image(np.random.rand(10,10))
reduced = img.table().reduce(['x'], np.mean, np.std)
hv.Curve(reduced) * hv.ErrorBars(reduced) + reduced

Seems fairly reasonable to me but I'd be happy to hear feedback.

jbednar commented 8 years ago

Fabulous!

jbednar commented 8 years ago

What data structure is returned by collapse() and reduce()? What does it look like when just typing reduced or collapsed at the command prompt for these examples?

In any case, please get these examples into a tutorial so that we can close this issue. It might also be fun to add an example of higher dimensionality -- e.g. can you collapse a stack of Heatmaps to get an average Heatmap, and then overlay a set of points whose position matches each bin (i.e., a dot at the center of each pixel), whose size conveys the spread of that bin/pixel value from the original Holomap of Heatmaps?

philippjfr commented 8 years ago

This is not merged yet as I'm waiting on feedback from @jlstevens. Generally reduce and aggregate maintain the type of the object you were working with but I'm now considering whether they should always just return a Table. In the case of collapse however it should continue to maintain the type, i.e. collapsing a bunch of curves keeps them as Curves, while reducing or aggregating a curve will give a Table.

philippjfr commented 8 years ago

It might also be fun to add an example of higher dimensionality -- e.g. can you collapse a stack of Heatmaps to get an average Heatmap, and then overlay a set of points whose position matches each bin (i.e., a dot at the center of each pixel), whose size conveys the spread of that bin/pixel value from the original Holomap of Heatmaps?

Sure, should be trivial. Really this belongs in the still non-existent transforming data tutorial or the also incomplete working with columnar data tutorial.

jbednar commented 8 years ago

I'm not sure what it means for the result of collapsing a bunch of Curves to still be "Curves"; shouldn't it be a Curve (singular)? I'm trying to figure out how it's not already an overlay of a Curve and an ErrorBar or Spread object, i.e. where the various bits of data are stored in this compound object.

philippjfr commented 8 years ago

I'm not sure what it means for the result of collapsing a bunch of Curves to still be "Curves"; shouldn't it be a Curve (singular)?

Yes.

I'm trying to figure out how it's not already an overlay of a Curve and an ErrorBar or Spread object, i.e. where the various bits of data are stored in this compound object.

Because then you'd also let the user choose the various types. Better to have one operation that computes the mean and error and then you can choose the types that data should be displayed as. All this does is add additional value dimension(s). You can see that in both examples, it's added a column to the table to represent the standard deviation. By casting this Table to an ErrorBars or Spread type it will automatically assume the dimension is the error and plot it appropriately.

jbednar commented 8 years ago

So Curve just ignores the extra column, so that if you type collapsed above you'll just get the mean curve (since you say the result is a Curve already)? If so, (a) why not just type collapsed above, where you wrote hv.Curve(collapsed)?, and (b) that seems tricky, i.e. that the error bars are being silently ignored. Presumably it makes sense in terms of the overall support for columnar data, though.

philippjfr commented 8 years ago

So Curve just ignores the extra column, so that if you type collapsed above you'll just get the mean curve (since you say the result is a Curve already)?

All Element types support additional value dimensions, unless you tell the plotting code to do something with them they will simply be ignored. Certain plot types allow you to map the additional dimensions onto some property, e.g. color, size of points, and in the case of ErrorBars the second value dimension is automatically interpreted as the spread. The way you should think of it is that the Elements are views into high dimensional spaces mapping dimensions onto particular plot attributes.

If so, (a) why not just type collapsed above, where you wrote hv.Curve(collapsed)?

True could have done that.

jbednar commented 8 years ago

Ok, that comes back to the idea (discussed above, but not sure whether explicitly rejected) of a Curve type that inherently knows about possible spread information (presumably via some user-definable parameter for choosing the class for the spread type) and plots it if present. Doing that is just a convenience, of course, for this extremely common case of data with error info, but obviously it can already be done compositionally, and doing so is of course more general (if more awkward and verbose). So I'll leave that decision entirely up to your discretion, with the caveat that adding this convenience would have to be done now if at all, since it changes the behavior of typing just collapsed.

In any case, in the current version, I'd strongly argue for just writing collapsed for the examples, to convey clearly that the result of the collapse operation is a Curve from which the error/spread info can be extracted if desired.

jlstevens commented 8 years ago

First I should say I am very happy we waited to act on this! This thread is very long and I found the initial discussion was very confusing and scarily complicated. Now with the new Columns format and the Data API, the new approach is much, much cleaner. I really like the way this has been done!

I think I follow Jim's suggestion - a Curve that shows error bars (for instance) if that data is there. Although that is technically possible, I really like the current approach as it is very compositional. The Data API makes it trivial to visualize all the available columns of data (as any element, including error bars) so I don't see much benefit making Curves behave in a more complicated way.

As for the name spreadfn, I can't think of a better name right now. It isn't ideal but I do prefer it to the other alternatives put forward so far.

In any case, in the current version, I'd strongly argue for just writing collapsed for the examples, to convey clearly that the result of the collapse operation is a Curve from which the error/spread info can be extracted if desired.

I agree.

Anyway, I think it is great that this functionality (which is useful!) can be generalized so nicely to work with the Data API - at this level, the functionality is made available at a fundamental level instead of being tacked on in an ad hoc (and ugly) way.

jbednar commented 8 years ago

Sounds great! Let's document it and close the issue.

philippjfr commented 8 years ago

Just pushed the implementation in commit e2729b0fe90dc8dc923c7004e1a095c9aba0b634. Where should I document this? I'm currently putting together a notebook about working with columnar data, would fit in nicely there.

jbednar commented 8 years ago

Sure, there is fine. And then push that notebook to the website, without waiting for it to be polished -- don't let it sit around like the mythical Transforming Data tutorial! :-) If nothing else, once it exists, I can polish it, but until then I can't even discover what the syntax is. If necessary, it could be left untested while you are working on it, but something is clearly better than nothing, both for columnar data and for transforming data.

philippjfr commented 8 years ago

Ok, that comes back to the idea (discussed above, but not sure whether explicitly rejected) of a Curve type that inherently knows about possible spread information (presumably via some user-definable parameter for choosing the class for the spread type) and plots it if present.

I don't think we'll go down this route at least for Curves. However for more complex types such as Bars I don't see any way around it, there's no easy way to overlay ErrorBars on categorical axes for now although in theory this should be implemented in future (should be easy in bokeh).