Aggregated pie charts - Githubissues

A common use case for pie charts is like histograms - aggregating all entries for a given label into a single slice, either with an explicit values field or without (then the value is just the count of items). I'm thinking about cases like https://simonbjohnson.github.io/Ebola-3W-Dashboard/ (the two pies in the middle are aggregating row counts by either activity or country) and the canonical "sales by region" pies where each slice is the sum of revenue within some category label.

In principle you can already do this with an aggregate transform, but this seems like a common enough use case (and transforms have enough drawbacks) that both we and our users would benefit from it being built in pie functionality.

Things to consider:

I think it's straightforward to do this within the pie type, rather than making a new type (like the distinction between bar and histogram). Does this cause any problems? At least at first that would restrict it to discrete labels, I suppose down the line we could add binning but that'd be a weird thing to do with a pie.
arrayOk attributes - do we need to provide eg color redundantly for every item? Do we just take the first value we find? Do we make some way to provide this once per slice, like a distinctlabels attribute and all the arrayOk items just map to that? I think we do NOT want to match the way we (accidentally) did it with histogram where these attributes map to bins (in this case that would mean the distinct labels, in whatever order they might show up)
Event data - see #2071

cc @etpinard @monfera

Lego-like construction

On one hand, bundling logic with view has benefits. A user who needs a histogram has the effortless task of just saying, type: 'histogram'. If she follows the docs and supplies proper attributes, she's in business.

On the other hand, we know that histogram is, conceptually, binning (a model) plus bar (a view). We also know that there is an infinite variety of models that nicely combine with bars; we could have waterfall, quantilebars (showing count or median of quantiles) etc. so I'm a bit uneasy with the coupling of models and views as histogram does.

Yet, histogram is by far the most common specialized, aggregate barchart - in addition to, or second only to a plain aggregate barchart. What you ask about pies is equally true for bars, ie. by analogy of histogram, we could have an aggregatebar which takes multivariate data like parcoords or table does and aggregates by some dimension.

Now it turns out that histogram is also an aggregatebar - where the aggregation is based on a tacked-on dimension we can just call bin.

There are benefits to reflecting such structures in code and in the API. In code, it can cut a lot of duplication, because things build on one another (ofc I'm not saying histogram doesn't reuse bar). Fewer things to fix, modify, keep in sync or bundle. Yet the bigger thing is that, if it's exposed to the user, allows her to construct their particular visualizations as if built from Lego (at least that's the pie in the sky theory). It also enables users to bring their alternative Lego blocks, such as some binner more appropriate to them.

Why is it not enough to let users do their own preaggregation? If they have an alternative binner, they can just compute histogram heights and just use our bar. True, it'll show up. But it'll lack granularity. For example, plotly.js will have no info on what scatter points to highlight when the user crossfilters on a bar, because it'll be opaque. So we need a solution which preserves links between the original, atomic data and glyphs, markers representing aggregates.

What's the context here? It's at least the dashboard (or its analogs, a Jupyter page or a scrollytelling webpage with multiple plots), because a bunch of interactions, especially things like crossfiltering, or recovering currently highlighted aggregates in a small table, needs to work.

The result is that we can consider our pipeline a directed acyclic graph which

originates perhaps in a single node (e.g. a datagrid on Plot.ly), or more nodes (multiple data sources in a notebook, or multiple online feeds),
goes through various processing nodes (filtering, simple aggregation, sorting, binning or other statistics, contour calculation etc.)
ends up in some number of views

This DAG is not necessarily a layered graph, the user may have a bunch of nodes in sequence along some paths and one transform or direct rendering of data to view in others.

It can go further than that. Individual plots, such as a lowly scatter, use aggregate statistics. For example, we compute min/max per axis so that we zoom on the right area. But there are other valid approaches, eg. fixed domain (which we support), or margining (fixed, relative or even deviation based), etc. etc.

If we deconstruct what individual plots mean, we'll end up with a smallish set of vocabulary and a compact grammar to bind them together, out of which lots of things can be composed. For example, easy to replace an SVG Lego block for a simple point rendering to a Canvas or WebGL Lego block, because the block serves a very limited, narrow, well-specified purpose with clean connections (interface) no matter where it is in the DAG. Basically this is a key idea around ggplot2 and Vega too.

tl; dr

So in short, I think that a new pie with built-in aggregation might be useful for clients but we ideally separate the concern of the outer API - which biases things for high-frequency use and supplies a set of common widgets with sensible defaults - from a more technical, internal structure that maps conceptual relations based on what they are, while perhaps also exposing it for clients who'd like to build custom visualizations with bespoke interactions.

plotly / plotly.js

Aggregated pie charts #2073

Lego-like construction

tl; dr