Support violin plot and probability density plots

kanitw commented 6 years ago

From https://vega.github.io/vega/examples/violin-plot/

A violin plot visualizes a distribution of quantitative values as a continuous approximation of the probability density function, computed using kernel density estimation (KDE). The densities are additionally annotated with the median value and interquartile range, shown as black lines. Violin plots can be more informative than classical box plots.

https://vega.github.io/vega/examples/probability-density/ is another related example

[ ] Understand https://vega.github.io/vega/examples/violin-plot/ and https://vega.github.io/vega/examples/probability-density/ examples throughly, search online to understand other violin and density plot variants, and define the scope that we want to support.
[ ] Understand how we implement composite mark thoroughly by looking at the [box-plot codebase](https://github.com/vega/vega-lite/blob/master/src/compositemark/boxplot.ts. (By summer, we should have reasonable error-bar example as well.)
[ ] Design density transform in Vega-Lite and see if we can already use area mark to reproduce the density area for violin.
[ ] Design composite mark syntax for violin (and density plot?)
- [ ] First we can focus on just the violin area part: design MarkDefinition block for Violin so that we can define property of the underlying density transform and other related properties
- [ ] Decide if we need a composite mark for density plot -- (probably yes), and make sure that the syntax for violin and density are consistent. (Also think if there is a better name for density too)
- [ ] For violin plot, we need to decide if we want to include interquartile range and median as a part of the violin composite mark (which is sort of like the "box" overlay on top of violin plot). The syntax here should be very consistent with box-plot.
[ ] Implement the code. Note that there is probably a good way to share at least some part of the implementation between the violin and density plot.

kanitw commented 6 years ago

The tricky part about this is that Vega's Violin plot depends on the Vega facet operator to split data into subgroups between passing it to density transform. (Density happens inside nested facet.)

1) Consider the solution above that suggests implementing density transform first.
Given VL's facet also always applies layout, we can't reproduce the violin example with axis using implement density as a transform unless we do one of the following:

a) Make Vega density supports groupby (which is basically in place faceting) b) Support a variant of facet without layout (pure facet in the data transformation sense)

Note: we can reproduce violin plot using VL facet operator, but we will then rely on row instead of y position for each violin.

2) Alternatively, we could consider implementing violin as its own special mark that produce underlying density transform. However, this approach will be less composable. (For example, density plots https://vega.github.io/vega/examples/probability-density/ shouldn't be its own mark but rather using area plotting output from density transforms.)

kanitw commented 6 years ago

We meet today to talk about this and conclude that we should make Vega density transform supports groupby.

HarvsG commented 5 years ago

We meet today to talk about this and conclude that we should make Vega density transform supports groupby.

@kanitw Any progress on implementation?

kanitw commented 5 years ago

No update yet

romainmartinez commented 5 years ago

Thanks for Vega-lite.

I often use violin plots and I am looking forward to use them in Vega-lite/Altair.

In addition, I use a lot of ridge plots (half violin) like this one:

mcmc_areas-rstanarm

Would you consider adding an option to the violin plot to allow similar figures to be made?

I made an implementation in python, with mark area and a custom kde function, but it is rather tedious.

Also, would similar figures in histogram be possible (for discrete variable)?

I'm sure anyone using Bayesian statistics would be grateful.

domoritz commented 5 years ago

Yes, once we have a kde transform in Vega, we can also support ridge plots.

denisshepelin commented 5 years ago

Yes, once we have a kde transform in Vega, we can also support ridge plots.

Has it already landed in the vega 5.0 (https://vega.github.io/vega/docs/transforms/density/)?

domoritz commented 5 years ago

We've had this transform for a while but it does not support faceting and that's a deal breaker. We've come to the conclusion that we need a kde transform that has a group by key.

domoritz commented 5 years ago

Depends on https://github.com/vega/vega/pull/1783

jheer commented 5 years ago

Once the new Vega KDE support lands, I think the first step here is probably to add a new density transform to Vega-Lite that maps to the Vega kde transform, with syntax such as:

{
  density: string; // value field to estimate density for
  groupby?: string[];
  method?: 'pdf' | 'cdf';
  extent?: [number, number];
  bandwidth?: number;
  steps?: number;
  as?: [string, string]
}

I think it should be called density rather than kde, as (1) density is a proper word, not an abbreviation, and (2) I can imagine extending the implementation in the future to fit a normal density (or log-normal, or Poisson, etc) to the input data, not just a kernel density estimate.

domoritz commented 5 years ago

Maybe method?: 'pdf' | 'cdf'; -> cumulative?: boolean. as should not be optional in Vega-Lite.

jheer commented 5 years ago

@domoritz I definitely prefer your suggestion of cumulative?: boolean.

Also, when adding violin plots we may want to support multiple scaling options. The default (at present) is that all violins share the same scale based on the sampled density estimates, which of course was a primary motivation for adding the kde transform with groupby support in Vega. We may still also want to support other forms of scaling or normalization.

The reason I'm thinking about this is that, if an explicit bandwidth parameter is not applied, each group will have its bandwidth independently set using an estimation heuristic. This means that each plot has different kernel width, which in turn means that one could have potentially large disparities in how much of the probability mass gets "clipped" when drawing violins only over the domain of observed data values. The tails of the KDE distribution get cut off, such that the total amount of probability mass shown in each violin is unequal. (This issue can still arise with a shared bandwidth parameter, it's just not as extreme.) It may be that the "right" thing to do is add a normalization pass in the KDE transform whenever we have more than one group.

So, I think we might need to do some additional research into the "proper" scaling and trimming of violins. I don't know how carefully other tools have looked at this!

jheer commented 5 years ago

The ggplot violin options page shows that these questions are largely left to end users, with the default being the same as proposed above (without normalization of trimmed density areas):

From https://ggplot2.tidyverse.org/reference/geom_violin.html:

trim | If TRUE (default), trim the tails of the violins to the range of the data. If FALSE, don't trim the tails.
scale | if "area" (default), all violins have the same area (before trimming the tails). If "count", areas are scaled proportionally to the number of observations. If "width", all violins have the same maximum width.

Note that Vega currently supports options corresponding to ggplot's area and width values for the scale parameter, based on how we configure the scale domain. Our KDE implementation normalizes (divides by the number of data points) to form a proper PDF, so we could support a count option (if desired) by multiplying the estimated density by the count of points within a group. If that is of interest we could update the kde transform accordingly.

domoritz commented 5 years ago

@jheer said about implementing violing plots with the new KDE transform in Vega:

The issue is not one of performance or extra transforms, but of correctness. (FWIW, I'd want to avoid a "density-center" option, as that strikes me as confusing and an abstraction-level violation.) The previous Vega violin plot example used stack, and it worked because all densities we scaled independently and so used the full width/height of the scale band. But this independent scaling is misleading and hampers accurate comparison.

The new KDE transform supports groupby, so we can use the output to define the domain of a scale at the top-level, which then scales all the densities in a proper fashion. The result is that different densities have different max width/height. Yet, the stack transform center option only centers the mark relative to the observed height (not the max height among all densities), causing inappropriate, non-uniform center-line offsets for the different densities.

My solution in Vega is to instead use xc/width or yc/height for the violin densities (as well as using xc or yc for the median and IQR annotations). This is simple and correct. A top-level linear scale is used to provide the width / height values.

kanitw commented 5 years ago

Btw, I run into a "split violin plot" in seaborn. It's definitely worth considering how this fits into our grammar.

jheer commented 5 years ago

Interesting! An alternative that might be a bit better perceptually could be to directly layer (overlay) the conditional violins (or zero-baseline distribution areas) with some opacity. That would make the value and shape comparisons even more apparent. I hope new VL extensions can also support that, which should hopefully be simpler to specify (or, at least, require less new surface area).

cmcaine commented 4 years ago

Ridge plots are another alternative for this kind of thing and often work well.

There's a good package for ggplot for generating them.

SamWoolerton commented 4 years ago

Looks like ridge plots are supported now (can groupby in density transform), haven't figured out how to pull off violin plots yet though

mjskay commented 4 years ago

@domoritz said my comments were welcome so here you go. Do tell me if this is off topic :)

Basically my feeling about a lot of uncertainty vis these days is you break it into (1) a representation of a distribution (be it analytical or empircal) as a PDF (f(x)), CDF (F(x)), and inverse CDF (F^-1(x)); and (2) mappings of those functions onto visual channels.

Then the question is, is there a mark/geom (probably closest is area in vega-lite, though it might not be quite the right one---can you map a continuous variable onto color in an area?) that lets you use those mappings to create densities, violins, gradient plots, CDF barplots, etc. FWIW, I made a "slab" geom for doing this in tidybayes on top of ggplot (and a composite "slabinterval", which is a slab combined with an interval). All of the geoms below (except the dotplots) are just shortcuts for different variants of the underlying slab+interval geom:

It's a bit different from how area works in either ggplot or vega-lite in that, because it is not intended for stacking, it does not use the "y" aesthetic/channel for the height of the slab; rather it uses "thickness" (or I suppose you could call it "width" but that already has another meaning in ggplot). This allows you to map a different variable to the y axis to easily create ridge plots / half-eye densities / etc where you would normally use intervals, without having to screw around with creating facets (this is incredibly useful for visualizing coefficients and the like, because creating facets just for coefficients is a pain --- you have to mess with header text angle usually --- plus often you want to facet over something else). It also allows color and opacity to vary within the geom, which is useful for creating gradient plots and for creating densities with highlighted regions.

Anyway the upshot is, if you think abstract grammar-of-graphics mappings from data onto channels (so, not about the particular syntax of a given package, but a formal description of the visualization: "z -> x position" being the equivalent of aes(x = z) in ggplot or an encoding of {"x": "z"} in vega-lite), you might have a density plot for a variable z described as something like this:

z -> x position f(z) -> thickness

or a gradient plot described as:

z -> x position f(z) -> opacity

or a CCDF barplot described as:

z -> x position 1 - F(z) -> thickness

If you then add in the ability to do densities / CDFs / etc of analytical distributions (which is what the stat_dist_slab geom does), you can do the equivalent of:

z -> x position f_Normal(z|mu, sigma) -> thickness

Which is how you'd do a density plot for a normal distribution. Given an implementation of the Normal and the scaled-and-shifted t distribution you'd be able to do confidence distributions for a lot of common ways of summarizing uncertainty from frequentist models (so that gets you, basically, halfeyes / gradient plots / whatever else for visualizing uncertainty).

Last bit is being able to map color within slabs means given a data table roughly like this:

dist	theta
normal	[0,1]
student_t	[3,0,1]

You can do stuff like:

x -> x position dist -> y position f_{dist}(x|theta) -> thickness |x| < 1.5 -> fill color

Which yields something like this:

Anyway, I don't have specific suggestions for how these abstract specifications turn into syntax necessarily. What I did with slabinterval doesn't look exactly like the above abstract syntax, but I have found it helpful for thinking more formally about these visualization types.

kanitw commented 4 years ago

@mjskay -- Your comment is definitely very useful.

When we work more on this, we'll have to see how this interplay with offset channel that we plan to add (#4703).

mjskay commented 4 years ago

That's a good point --- having a different channel for thickness (rather than x/y) was partly motivated by how dodging works in ggplot (which is what offset is for in vega-lite?) because it makes it easy to do stuff like this:

which is pretty common when visualizing estimates from groups/subgroups

joelostblom commented 3 years ago

Although there is no dedicated mark for this yet I noticed that #5066 has been implemented so is is it possible to manually map the area width/height to the density value instead of dedicating one of the axes to this? I would like to make a plot where the y-axis is categorical with one density per y-value and then also facet this plot, so I can't use the trick in the altair gallery where the facets essentially replace the y-axis. Like the boxplot below, but with violins/ridges/densities:

For now I am using a binned mark point with the size set to count to approximate a stepwise distribution, which looks pretty cool but is not very formal =) At least it captures multimodality better than a box blot.

joelostblom commented 2 years ago

I am planning to use VL/Altair for a course I will be teaching several months from now where we will need to create violinplots. Since it was mentioned in #4384 that density visualization shortctus might see some development after the interactions were revamped, I just wanted to check in if there has been any internal discussion around where on the roadmap adding violin plots might fit in. I am really looking forward to have this together with the new offset channel which already is going to be super helpful on is own, thanks for continuously working on improving VL!

domoritz commented 2 years ago

You're very welcome. I'm excited to hear that you are planning a course with Vega-Lite/Altair. Are you using https://github.com/uwdata/visualization-curriculum?

Density visualizations were the next big thing I wanted to work on for Vega-Lite but I didn't get to it so there is no planned release date.

joelostblom commented 2 years ago

Thanks for the update! Yes I will be mixing from that and a few other courses I have developed previously. This one is going to have more emphasis on comparing distributions for many categories and I am hoping to include options that address the shortcomings of boxplots. Maybe I will try to create something via density plots via faceting, or compute KDEs via Python and use that together with the new offset channel to lay out points as violins, but there will likely be a fair bit of starter code that makes it less intuitive than what mark_violin would.

Edit: Added an example in https://github.com/vega/vega-lite/issues/8067 of how this can be achieved for density clouds in Altair and Vega (but not yet Vega-Lite)

domoritz commented 2 years ago

Totally agree. Great to hear that you have ideas for workarounds for now, though.

kanitw commented 2 years ago

FWIW, we have violin plot example in https://observablehq.com/@vega/vega-lite-distribution-plots

joelostblom commented 2 years ago

Thanks for adding to this issue, I can add some more thoughts myself too. I think one of the great advantages with a dedicated violin mark would be the ability to use it to easily compare and dissect multiple distributions within the same chart with a relatively simple spec without any transforms, and that is compatible with categorical axes, facets, offsets, and coloring. Something like this:

{
  "config": {"view": {"continuousWidth": 300}},
  "data": {
    "url": "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv"
  },
  "mark": "violin",
  "encoding": {
    "x": {"field": "tip", "type": "quantitative"},
    "y": {"field": "time"},
    "color": {"field": "smoker"},
    "yOffset": {"field": "smoker"},
    "row": {"field": "sex"}
  }
}

Which works for the boxplot mark and creates this useful visualization:

Open the Chart in the Vega Editor

If I understand correctly, one of the main issues is that the area of the density currently needs a dedicated axis for its height. Is it possible that violinplots could piggyback on the same mechanism/graphical channel that boxplots are using to define the height of the box and use it for the height of the density/violin area?

domoritz commented 2 years ago

Is it possible that violinplots could piggyback on the same mechanism/graphical channel that boxplots are using to define the height of the box and use it for the height of the density/violin area?

I guess they could. I hadn't thought of this idea since I thought we should have an axis to tell us what the height of the violin means but realize now that's not the case.

apraga commented 1 year ago

Hi, Just a small bump to check if there was any progress on this ? Thanks,

joelostblom commented 9 months ago

I discussed this with @kanitw and @jheer and we suggest that an MVP for this would be a composite mark that works similar to how the boxplot and errorbar/band composite marks work currently. The first version of this MVP would use the offset channel to map the width of the violin (ie the value of the density), and would thus not be compatible with an additional categorical grouping in the Offset channels (later version would ideally be, but this is a good start).

If someone would be interested to help with this, we very much welcome contributions. The source for the existing composite marks are in https://github.com/vega/vega-lite/tree/main/src/compositemark, which could serve as a starting point for creating the violin MVP. The relevant parameters to pass from the mark to the underlying density transform would be bandwidth, extent and steps as a start.

mjskay commented 9 months ago

If it's helpful, I finally wrote up a description of how ggdist handles this kind of thing (to appear at VIS this year): https://osf.io/2gsz6

joelostblom commented 9 months ago

Thank you for posting that @mjskay ! It looks super interesting and on a bit of a side note from this discussion I wish there was a similar package like ggvis for visualizing uncertainty with vega-lite. It seems like a library that support interactivity (and soon hopefully also animation) like vega-lite could be particularly suitable to visualize uncertainty in an intuitive way (I'm thinking of e.g. animated hypothetical outcome plots).

More on topic, I played around a bit with what the Altair/VL spec would look like for a violinplot and came up with some ideas as well as identified some obstacles. @kanitw @domoritz Do you think adding either an y2Offset channel or stack='center' to the existing yOffset channel would be something you are interested in supporting and not too much work to implement? That would allow a relatively simple spec like this to be extended to support violin charts (this could already be the spec for the 1D density mark with a calculate transform that inverses the density values and a shift of the baseline of the tick):

import altair as alt
from vega_datasets import data

alt.Chart(data.cars(),height=200).transform_density(
    'Miles_per_Gallon',
    as_=['Miles_per_Gallon', 'density'],
    extent=[5, 50],
    groupby=['Origin']
).mark_area(interpolate='linear-closed').encode(  # needed to avoid areas extending to 0
    alt.X('Miles_per_Gallon:Q'),
    alt.Y('Origin:N'),
    alt.YOffset('density:Q'),
    alt.Color('Origin:N'),
)

Open the Chart in the Vega Editor

The main issue is getting the reflected density, which is what the suggestions above would help with. It's currently almost possible to get all the way there by doing something more complicated like the following, but the areas are note quite overlapping (as can be seen in the white small gap in the red violin):

alt.Chart(data.cars.url, height=200).transform_density(
    'Miles_per_Gallon',
    as_=['Miles_per_Gallon', 'density'],
    extent=[7, 50],
    groupby=['Origin']
).transform_calculate(
    density2='-datum.density'
).transform_fold(
    ['density', 'density2'],
).mark_line(interpolate='linear-closed').encode(
    alt.X('Miles_per_Gallon:Q'),
    alt.Y('Origin:N'),
    alt.YOffset('value:Q'),
    alt.Detail('key:N'),
    alt.Color('Origin:N'),
    alt.Stroke('Origin:N'),
    alt.Fill('Origin:N')
)

Open the Chart in the Vega Editor

Interestingly, if you change to an area mark instead of a line mark in the spec above, you get an odd chart with unknown observations in a fourth "origin" category (I think the first approach is way simpler as long as supporting and y2Offset encoding channel or a stack=True option for the existing yOffset channel to help create the reflected density):

Edit: Note that all these charts look odd without the color encoding since the area mark is then connecting the charts with each other. We would therefor not only need to reserve the Offset channels for the density value but also the Detail channel to make sure that the areas stay disconnected from each other.

vega / vega-lite

Support violin plot and probability density plots #3442