holoviz / holoviews

With Holoviews, your data visualizes itself.
https://holoviews.org
BSD 3-Clause "New" or "Revised" License
2.69k stars 402 forks source link

ECDF as an Element? #3821

Open justinbois opened 5 years ago

justinbois commented 5 years ago

I was thinking of putting together a PR for this, but wanted to see what people thought before investing the effort. I think empirical cumulative distribution functions (ECDFs) are very important plots for visualizing how data are distributed. I argue that because there is no choice of binning or bandwidth, and all data are plotted, they are even better than histograms (hv.Histogram) and KDEs (hv.Distribution). Below is a comparison of a histogram and an ECDF.

I would like to make a new plotting element, hv.ECDF. Are the core devs open to this idea?

image

image

jbednar commented 5 years ago

Sounds good to me!

As a minor note, the example looks cut off near zero on the y axis.

Is there a more descriptive name for it? The only acronyms used as element names so far are RGB and HSV, which I think are much more widely recognized than ECDF as an acronym (which to me means our Edinburgh supercomputers!). CDF seems easier to recognize; does the fact that it's empirical really need to be emphasized? Or maybe CumulativeDistribution?

justinbois commented 5 years ago

I'm glad you like the idea!

I prefer ECDF; it's pretty well known. The first page of hits on Google are all for the right thing, with the top hit being the Wikipedia page, and the second being the R function ecdf.

I don't think we should use CDF, because the ECDF is a very special CDF; one for an empirical distribution. CumulativeDistribution might work, but I worry about misinterpretations with Distribution, which uses a KDE. CumulativeDistribution might then be interpreted to mean the KDE plotted as a CDF. I guess EmpiricalCumulativeDistributionFunction is too verbose, but if verbosity is not a real issue, I prefer that to the other alternatives, which can leave open misinterpretation.

The y-cutoff on the plot is because I was panning and zooming before I took the screenshot. Here's a PNG of the pre-panned ECDF (how it should look).

bokeh_plot-5

jbednar commented 5 years ago

I had all those thoughts too, but was hoping you'd disagree. :-) I can't think of any other name, then.

poplarShift commented 5 years ago

Just as a user, I think I'd almost prefer EmpiricalCumulativeDistribution (over ECDF, that is), or EmpiricalCumulativeDistributionFunction, just for code clarity and to avoid acronyms (which may be well known in the stats-minded community but I'm not sure about elsewhere). Verbosity is not an issue I would say (you'd only type Empi+tab anyway).

jbednar commented 5 years ago

Up to a point, verbosity isn't an issue, but I do think that the full name EmpiricalCumulativeDistributionFunction will cause us formatting problems in tables, lists, etc. where we enumerate Elements. I think the name should be something in the range of lengths of existing elements.

poplarShift commented 5 years ago

Good point.

On the other hand, I don't really see the point in specifying Empirical for this Element anyway. To me the main point is that it is a cumulative distribution, and whether or not it is based on the empirial distribution or a KDE should live in a plot option.

Putting options like these, which have nothing to do with the grammar of the underlying visualization, into the name of the element, would be an inconsistency with holoview's grammar and API, in my mind anyway.

In that sense, one might almost just enable it as an option of CumulativeDistribution, unless I'm missing something?

Edit: Just an idea, one might even consider a plot option cumulative=True/False on Distribution, but I haven't thought that one through.

philippjfr commented 5 years ago

Just an idea, one might even consider a plot option cumulative=True/False on Distribution, but I haven't thought that one through.

I think this is the right approach as well. I'd suggest implementing this either as a new operation or as a parameter on the existing univariate_kde operation (if that makes sense) and then expose it as a plot option on hv.Distribution.

justinbois commented 5 years ago

I think it is important to draw some distinction between Elements that display all data points and those that do not. Here are some examples with code and rendered elements.

Examples that do not plot all of the data:

import pandas as pd
import holoviews as hv
from bokeh.sampledata import autompg

hv.extension('bokeh')

df = autompg.autompg_clean

bw = hv.BoxWhisker(df, kdims=["origin"], vdims=["mpg"])

dist = hv.NdOverlay(
    {origin: hv.Distribution(group, kdims=["mpg"]) 
         for origin, group in df.groupby("origin")}
)

bw + dist
Screen Shot 2019-07-16 at 12 39 55

Examples that plot all of the data

scatter = hv.Scatter(df, kdims=["origin"], vdims=["mpg"]).opts(jitter=0.3)

yticks = [(i + 0.25, origin) for i, origin in enumerate(df["origin"].unique())]
spikes = hv.NdOverlay(
    {
        origin: hv.Spikes(group["mpg"]).opts(position=i)
            for i, (origin, group) in enumerate(df.groupby("origin", sort=False))
    }
).opts(hv.opts.Spikes(spike_length=0.5, yticks=yticks, show_legend=False, alpha=0.3))

scatter + spikes
Screen Shot 2019-07-16 at 12 41 04

Difference between ECDFs and Distributions

The ECDF is decidedly different from an hv.Distribution, which does one thing: compute a KDE and render it. Rather, an ECDF is a plot of all the data. This may be clearer if we plot the ECDF as "dots," as is often done. This has the same information the "staircase" representation, with each point being located at the concave corners of the staircase.

In this case, each dot is a measured data point. The ECDF can be thought of as a transform, like a jitter or beeswarm transform, that determines the y-position of the glyph (where the x-position is the measured value).

So, the "empirical" part of ECDF is a key feature. It means we are plotting data points directly, with a transform that does not lose any information about the measured data themselves.

A cumulative=True option for hv.Distribution is something different; it means to plot the integral of the PDF returned by the KDE. This is qualitatively different from an ECDF, which is a plot of actual measurements.

philippjfr commented 5 years ago

This is qualitatively different from an ECDF, which is a plot of actual measurements.

That makes sense. I'd probably be okay with adding an ECDF element in that case, I can't see a verbose name that would capture this well otherwise.

poplarShift commented 5 years ago

@justinbois If I'm looking at a distribution I'm looking at a distribution, no matter whether it has all the data points or only some smoothed estimate of them. The information you want to convey to the audience is the same (modulo some of the detail). How the thing that you want to show is computed from the input is just a matter of preprocessing, i.e. implementing an operation of the right name and semantics.

Following your logic, one could also have two different elements for BoxWhiskers, one for where the medians are computed from the data, and another one where the medians are directly supplied to the element (as discussed e.g. here https://github.com/pyviz/holoviews/issues/2183), or two different VectorFields, one accepting angles/magnitudes, the other x/y vector components (as discussed here https://github.com/pyviz/holoviews/issues/3486). I don't think we should have different elements for each kind of input or intermediary processing being done.

Anyway that's just my two cents.

Edit: I understand that people like ECDF because they want to convey detailed information about all the data, outliers etc., that are hard to put into histograms. Still the semantic content is that it is a cumulative distribution.

(And just in case you're wondering, I think it would actually be logical if a hypothetical hv.Distribution(dataset).opts(cumulative=False, kernel='empirical'), or similar, would actually output something like a histogram with the smallest necessary bin width to separate all occurring values.)

justinbois commented 5 years ago

@poplarShift, I see what you're saying, but I'm concerned that taking this approach might make the umbrella of what is a Distribution too wide. By similar arguments, we could say that hv.Bivariate((x, y)).opts(kernel='empirical') should replace hv.Points((x, y)). Or that hv.Distribution and hv.Histogram are the same thing as well–they are both used to visualize PDFs; one uses binning and the other Gaussian KDE. We might then replace hv.Histogram with something like hv.Distribution(data).opts(kernel=<some object that specifies binning>).

Use of kernel='empirical' would result in a different kind of plot for a hv.Distribution than how it is currently defined (as a KDE), since an empirical PDF is a linear combination of delta functions (which we cannot show graphically). A reasonable representation would be a rug plot with some transparency to show repeated values, or hv.Spikes where the height of each spike is given by the number of measurements with the same value. Here, hv.Spikes is already in place; it's something that plots all of the data and is different from hv.Distribution.

I think @poplarShift's suggestion in #3486 regarding VectorFields is a very good one. But I do not think that is the same issue we are discussing here, since the transformation between angles and magnitudes to x-y velocities is bijective. The actual data cannot be retrieved from a KDE.

To your point about looking at a distribution, whether you are looking at points or smoothed estimate, I argue that these are actually not the same thing. The information in a KDE and in an ECDF are not the same, just as a regression line gives different information than the data themselves. You can also think of it this way: Is a hv.Points Element looking at a distribution? In some senses, yes, since the data could be sampled out of a (possibly unknown) distribution. But, as I mentioned before, I think it makes sense to have hv.Points and hv.Bivariate be separate Elements. Thinking of a data generating process, there is some underlying generative distribution, so one could argue that any plot of measured data is in fact looking at a distribution and we could further overload hv.Distribution.

Thanks for the engaging discussion, all. This is the sort of thing that make open source software healthy.

poplarShift commented 5 years ago

one could argue that any plot of measured data is in fact looking at a distribution and we could further overload hv.Distribution

When I plot a Points element, I don't attach specific information content beyond what the data are. Of course a subset of that information is also the distribution, but there's more: Distances between the points, values attached to each measurement (e.g. colour-coded), etc. - 'simply' the raw data and the fact that they exist. In this sense, Points is semantically overloaded already, because the reason I'm plotting them may be very diverse.

Plotting an ECDF, on the other hand, like other kinds of distributions, serves a more specific purpose: What values did I observe, how many of each, and how do they relate to each other. So of course there's a bijection between an ECDF and a list of values, but I'm showing them in a specific way, namely such that the audience knows about their distribution.

We might then replace hv.Histogram with something like hv.Distribution(data).opts(kernel=).

Well, I would not actually be strictly opposed to that. But of course the notions of histograms and distributions and their delineation have a long tradition. ECDF, I would argue, has less of a general-science tradition attached to it.

A reasonable representation would be a rug plot with some transparency to show repeated values, or hv.Spikes where the height of each spike is given by the number of measurements with the same value.

I agree with that too, and again one would be utilizing a semantically overloaded Spikes object and reduce it to one specific of its many possible functions.

Thanks for the engaging discussion, all. This is the sort of thing that make open source software healthy.

I agree, I've learned a lot already!

justinbois commented 5 years ago

I agree that the primary use of an ECDF is to visualize a distribution, as also described in the ggplot2 documentation. Off hand, the only other use I can think of is its complement (ECCDF) being used as a survival curve, which is actually just an ECCDF anyhow; it's just sometimes not called that.

But that is also the primary use of a violin plot (hv.Violin), a KDE (hv.Distribution), and a histogram (hv.Histogram). And also hv.HexTiles and hv.Bivariate. All of these are unique Elements.

So, I think maybe we narrowed the question down:

It's important to note that ECDFs are also configurable. You can choose to plot the complement. You can choose to plot them as dots, staircases, or even "formally" (something like this). If you plot them with dots, you can have hover tools to give more information about each point. I think putting this configurability into hv.Distribution would be confusing and cluttered.

Due to some other commitments I will not be able to start working on this for at least a week, so in the meantime, it would be good to work out what we would like for the API for ECDFs. I'm in favor of its own Element (with valid points), and I think @poplarShift disagrees with me (also with valid points). It'd be good to hear other opinions.

justinbois commented 5 years ago

Would @jbednar, @philippjfr, or others like to give their opinions on whether an ECDF should have its own Element? @poplarShift, do you have further thoughts?

jbednar commented 5 years ago

It's tricky, because HoloViews is not currently consistent in this regard. I personally would rather it tend towards becoming consistent, which I believe would require that Elements have a direct mapping between the underlying data and the visual representation, with operations used for any significant transformations (reductions, aggregations, fitting). To do that several existing Elements would need to be deleted or deprecated and users would instead be told to use an operation that returns an existing Element type:

Others would require making new Element types (or changing existing ones):

I think this would make HoloViews much easier to reason about and work with by separating the transformation operators from the visible representations, and would make it more flexible by making it feasible to supply data to directly control the visible representations. On the other hand, @jlstevens suggested that @philippjfr and he already discussed this at length, and ended up agreeing that the convenience of being able to control the Element options using the hv options system outweighed trying to make it consistent like this, but I'm not sure I'm convinced about that. In any case, it would represent a good bit of work to make it consistent, which means that even if we did agree to go that direction, it wouldn't happen soon or easily. Still, if we did agree that we'd like to go in that direction, it would tell us what to do with ECDF.

jbednar commented 5 years ago

Note that the above proposal matches how the HoloViews rasterize() operation currently works, which could also be called bin2d_rect, i.e. accepting hv.Points, etc. and returning hv.Image. This is in direct conflict with hv.HexBin, which also does 2d binning (with a different bin shape), but works entirely differently, as an Element that does the aggregation internally. I don't see any particular reason these two nearly identical types of 2d binning should use entirely different approaches, and inconsistencies like that lead to ambiguity in how new functionality like ECDF should be handled.

philippjfr commented 5 years ago

I'm personally fairly strongly against this proposal, the inconsistency is quite frustrating but the fact that Histogram actually contains the data representing the visual geometry and not the raw data has been a much greater source of frustration and confusion than the fact that other statistical aggregates do the opposite. I do concede that this is mostly a naming problem.

Nevertheless I think the convenience of having statistical aggregations is quite important. The conceptual leap of declaring a Dataset (for 1D data, e.g. histograms, kdes) and Points (for 2D data, e.g. bivariate kdes) and then importing and applying an operation to get the plot you want is a huge hurdle for users.

What I think would help consistency is if each of the statistical operations would map to a specific or multiple geometry element types on the output. That way we would still have the convenience of the statistical element but also have a clear story about how they map to actual geometries that get drawn on the plot. Indeed in the background a number of elements are already implemented this way, e.g. Distribution gets converted to an Area/Curve behind the scenes using the univariate_kde operation and Bivariate gets converted to Contours using the bivariate_kde operation. Even HexTiles works this way albeit in a hacky way, in that the hex_binning operation returns an aggregated HexTiles element with q/r coordinates rather than raw x/y coordinates, which is used exclusively internally in the plotting code. I strongly agree that we should do the same thing for the remaining elements mentioned above.

The other reason statistical element types are useful is something we have been discussing recently, which is that they make cross-linking much easier, if you can apply selections to the raw data directly there's no need for some additional mechanism to implement cross-linking/brushing.

This is part of the reason I've been trying to push for a module of geometry elements which will make it easier to represent the various higher-level plot types better.

jbednar commented 5 years ago

Yes, Jean-Luc warned me that you were the one that argued it should be the way it is, and that he grudgingly went along with it. :-) But it sounds like you are proposing an alternative mechanism for achieving consistency here, with a different outcome than in my proposal but still greatly improved consistency compared to the current approach, by making it clear that the transforms can be done by the element but are all optional and can be invoked explicitly when needed. If that's what you're saying, could you flesh out your proposal, to contrast it more directly with mine (which I'll call the "Elements for data, operations for transforms" approach), saying which Elements and Operations you'd propose to change or add?

I personally would argue that the "huge hurdle for users" is that things are not consistent, and so users can't quickly and intuitively jump to conclusions. When things are clear and consistent (Elements == data, operations == transforms), users can learn this easily from a small number of examples. When some Elements include transforms, it's muddy and much harder to learn; the best guess of a user is quickly invalidated in other examples, and then lots of examples or explanations are needed. Inconsistency is the hurdle here!

philippjfr commented 5 years ago

I mean inconsistency definitely is a hurdle but you can't tell me that this isn't another significant hurdle:

from holoviews.operation import univariate_kde
ds = hv.Dataset(df, [], 'value_column')
dist = univariate_kde(ds)

# vs.
dist = hv.Distribution(df, 'value_column')
jbednar commented 5 years ago

It's true that it's been annoying to have to import operators each time (as I very often rasterize and datashade), but that's something that could be addressed the moment we wish to, by promoting operations into the main hv namespace. I'm not arguing for or against doing so, but to make it comparable, assume we did so, and then the difference is:

dist = hv.univariate_kde(hv.Dataset(df, vdim='value_column'))

vs.

dist = hv.Distribution(df, 'value_column')

Moreover, in practice I think it would very often be more like:

dist = hv.univariate_kde(e)

where e is some Element that people already have lying around that they want to see the distribution of. I.e., sure, sometimes people would need to make a Dataset like that, but it seems to me that they often wouldn't need to.

justinbois commented 5 years ago

I think the discussion has gone into some central issues about how HoloViews is organized. I Guess adding new functionality and thinking about how to do it leads to these kind of discussions.

While I'm very interested in the discussion, I think it mostly in the domain of the core devs (which I am not), so you should take my opinions here with that in mind.

To summarize: On the ECDF issue, we are all agreed HoloViews should offer that capability, but are not clear on how. The how question has raised notions about how Elements and operations are organized.

My $0.02 on the matter is that, as primarily a user (not a developer), HoloViews's structure has helped me think about my plots as is so nicely described at the beginning of the user guide; annotate your data with:

  1. Type of plot
  2. Key and value dimensions
  3. Groups and labels

In practice, the type of plot is set by the Element I choose. As I have argued before, I think of an ECDF as a specific, separate type of plot, so hv.ECDF makes sense for an Element. This would tend to favor more elements versus @jbednar's elements for data; operations for transforms.

A tricky point comes when figuring out how fine-grained are the element choices. For example, I think a datashaded scatter plot (made with hv.Points) is not a different type of plot that a non-datashaded scatter plot. But maybe a hexbin is different than a scatter plot.

That said, @jbednar's concept that each Element has a direct mapping to data, and applying an operation on an element gives a new element with a possibly new data that maps to it (e.g., as in the box plot example) is very appealing to me. This means fewer Elements, and everything is more direct. It just seems a bit detached from the kind of thinking I've been doing that I outlined above based on the user guide (and I'm fine with that; the direct data-to-visual is conceptually pleasing).

W.r.t. adding ECDFs for the short term, if there might possibly be a major restructuring in the future, perhaps it's best for now to have an ECDF be its own Element, and then we can steal code from that element to whatever it transform it may become?

poplarShift commented 5 years ago

I think it makes sense to have mostly operations (along the lines of what @jbednar wrote), but wrap them into Elements using Compositors (for option styling and convenience along the lines of what @philippjfr wrote). For now, that wouldn't lead to much rogue code flying around that would need to be moved about later if one decides to go one way or the other (i.e. less elements, more operation-/plot-option centric, or more differentiation at the element level).

So concretely for the proposed ECDF element, how about an operation ecdf or empirical_cumulative_distribution_function (or whatever people like - as it's not an element, there would probably not be as many problems with verbosity though) that returns a Curve element. This can then be folded into an ECDF element using a compositor, or, later, if that's where the project ends up going, wrapped into plotting options in some sort of distribution, or just used on its own.

jlstevens commented 5 years ago

That way we would still have the convenience of the statistical element but also have a clear story about how they map to actual geometries that get drawn on the plot. Indeed in the background a number of elements are already implemented this way, e.g. Distribution gets converted to an Area/Curve behind the scenes using the univariate_kde operation and Bivariate gets converted to Contours using the bivariate_kde operation.

I think this is a partial incarnation of the 'macro' idea I was pushing for a fair while back: something that looks like an element but is really implemented as an operation mapping to other elements behind the scenes. I don't think things are currently implemented at the same level as I originally envisioned and I still don't have a fleshed out proposal for making this idea concrete.

For what it is worth, I originally sided with @jbednar when this was originally discussed with @philippjfr but I now agree more with @philippjfr - the convenience of being element-like instead of operation-like is very high. I also agree that the fact that the .data of Histogram should really be the data that is to be binned not the visual representation.

I'll also note that introducing new elements that are meant to be about visual representation is orthogonal to anything else and can still be done. I can imagine a Hex element that is just about the hexagons of HexTile and a generalized version of Bars that could represent the visual data in histograms.

Maybe if things like HexTile and Histogram offered a way to return the Hex and Bars as the visual representation of what gets rendered to the screen (also the elements that actually get plotted, after statistical computation)?

Edit: I know that generalizing the plotting code for Bars to continuous axes (and supporting a constructor with bin ranges?) would be a huge pain. I'm just assuming something like this could exist for the sake of argument.

jlstevens commented 5 years ago

To summarize the discussion this morning with @jbednar and @philippjfr:

Current situation

Suggested plan.

  1. @jbednar didn't like the idea that elements using operations/compositors should be only associated with statistical elements. For this reason I propose statistical elements are an instance of a 'derived element' concept.

  2. We can then have a user guide section about derived elements where we can talk about the full set of such elements and make the link to the corresponding operations very clear.

  3. Then there are three more things we want to make this work well 1) Define a DerivedElement base class and corresponding (minimal!) API. 2) add new elements that map to the visual data more directly - these are the elements that are actually displayed after the necessary computation has occurred 3) expand the set of derived elements e.g Violin/BoxWhisker are not currently implemented with operations/compositor.

  4. The only difference I propose that DerivedElement makes is that it offers a .derive sub-object that holds the set of associated operations used to visualize that element.

  5. There is already discussion above about having elements that more directly map to the screen. For instance, Hex instead of HexTiles - though there is a bunch of renaming would like to do here) and Bars would need to support categorical/numeric axes to support bar charts/histograms.

  6. It seems that things like BoxWhisker/Violin could be implemented with operations/compositor as long as we have multiple operations on .derive for the various parts of BoxWhisker/Violin. This is why a .derive sub-object is nice 1) it lets you quickly find the operations associated with a derived element while keeping the name you want for the operation e.g univariate_kde 2) in the case of the more complex composite derived plots, there can be several operations there.

  7. With all this in place, I think the new user guide could tell a more consistent and compelling story.

Note that by switching plotting class, you could still use some native plotting functionality e.g if plotly has special support for some statistical plot.

jbednar commented 5 years ago

I agree that the above discussion captures what we discussed. There's one more thing we decided as well, which is that the way forward for the particular case of an ECDF is that the fundamental implementation is in terms of an operation, not an Element. I.e. that what should be submitted to the repository first is an operation, and then once we have that along with examples of it, we can make a compositor and a derived Element as a more convenient way to access that functionality, but that first we just need an operation.

So, if we haven't beaten you down too much already with our hemming and hawwing, @justinbois , we'd love to see a PR that contains an ECDF operation. We'd then review that, merge it, and seprately add the compositor and derived Element on top of that. Hope that's ok!

justinbois commented 5 years ago

That's definitely ok! I should have a PR in the new few weeks for you.

This discussion about HoloView's structure has been enlightening and enjoyable. I'm guessing you will migrate those considerations to new issue(s)?

jbednar commented 5 years ago

Thanks! I've opened one for Hex already, but @jlstevens, please open a new one that contains the above info and points to this issue; the new issue should contain the full plan, and then this issue can be the background for those interested.

justinbois commented 5 years ago

Oh, and one more thing about ECDFs:

They should not necessarily result in a hv.Curve element. I see three different ways of plotting ECDFs:

  1. As a staircase (most often encountered), which would be hv.Curve.
  2. As dots (at the concave corners of the "stairs"), which would be hv.Scatter. These are the actual data points.
  3. "Formally" (wherein the vertical lines of the staircase are omitted, kind of like this). This would be an overlay of hv.Curve and hv.Scatter.

I have found (3) to be truly awful to look at and would be happy to omit that.

I have found (2) to be really useful. In fact the first ECDF I encountered (Fig. 1B of this paper) did it that way. It allows for nice hover tools over the glyphs that are not possible because of the ambiguity of doing so with the lines of the staircase. I think HoloViews should enable both (1) and (2).

Finally, I sometimes overlay a bootstrap confidence interval of the ECDF, like below, which is nice functionality to have, but may be outside of HoloViews's scope (though maybe not because some statistical computing, like KDEs, is already done).

image

jbednar commented 5 years ago

If the ecdf operation returns a Curve, I'd assume the result of it can be passed to the Scatter constructor to create version 2. (hv.Scatter(ecdf(hv.Scatter(df))) Showing the spread would be useful, yes, in which case the operation would return an overlay of the Area and Curve plots.

poplarShift commented 5 years ago

I agree @jbednar, though isn't there the Spread element for that (instead of Area)?

philippjfr commented 5 years ago

If the ecdf operation returns a Curve, I'd assume the result of it can be passed to the Scatter constructor to create version 2. (hv.Scatter(ecdf(hv.Scatter(df))) Showing the spread would be useful, yes, in which case the operation would return an overlay of the Area and Curve plots.

I'd probably also expose this as a parameter on the operation, e.g. univariate_kde has a filled parameter, which becomes a plot option on the derived element and controls whether to return a Curve or Area.

jbednar commented 5 years ago

Sure, Spread is a closer fit. And yes, adding a parameter to switch between Area and Curve for the operation is useful, particularly if it is returning an Overlay rather than just one Element.

jbednar commented 4 years ago

Does any of this discussion need updating now that Elements store reproducible pipelines of the operations used to construct them? It doesn't have to, but it's now possible to e.g. get at the underlying data of a histogram whether or not it's available on .data of a Histogram element.