plotly / plotly.js

Open-source JavaScript charting library behind Plotly and Dash
https://plotly.com/javascript/
MIT License
16.73k stars 1.83k forks source link

Aggregated pie charts #2073

Closed alexcjohnson closed 6 years ago

alexcjohnson commented 6 years ago

A common use case for pie charts is like histograms - aggregating all entries for a given label into a single slice, either with an explicit values field or without (then the value is just the count of items). I'm thinking about cases like https://simonbjohnson.github.io/Ebola-3W-Dashboard/ (the two pies in the middle are aggregating row counts by either activity or country) and the canonical "sales by region" pies where each slice is the sum of revenue within some category label.

In principle you can already do this with an aggregate transform, but this seems like a common enough use case (and transforms have enough drawbacks) that both we and our users would benefit from it being built in pie functionality.

Things to consider:

cc @etpinard @monfera

monfera commented 6 years ago

Lego-like construction

On one hand, bundling logic with view has benefits. A user who needs a histogram has the effortless task of just saying, type: 'histogram'. If she follows the docs and supplies proper attributes, she's in business.

On the other hand, we know that histogram is, conceptually, binning (a model) plus bar (a view). We also know that there is an infinite variety of models that nicely combine with bars; we could have waterfall, quantilebars (showing count or median of quantiles) etc. so I'm a bit uneasy with the coupling of models and views as histogram does.

Yet, histogram is by far the most common specialized, aggregate barchart - in addition to, or second only to a plain aggregate barchart. What you ask about pies is equally true for bars, ie. by analogy of histogram, we could have an aggregatebar which takes multivariate data like parcoords or table does and aggregates by some dimension.

Now it turns out that histogram is also an aggregatebar - where the aggregation is based on a tacked-on dimension we can just call bin.

There are benefits to reflecting such structures in code and in the API. In code, it can cut a lot of duplication, because things build on one another (ofc I'm not saying histogram doesn't reuse bar). Fewer things to fix, modify, keep in sync or bundle. Yet the bigger thing is that, if it's exposed to the user, allows her to construct their particular visualizations as if built from Lego (at least that's the pie in the sky theory). It also enables users to bring their alternative Lego blocks, such as some binner more appropriate to them.

Why is it not enough to let users do their own preaggregation? If they have an alternative binner, they can just compute histogram heights and just use our bar. True, it'll show up. But it'll lack granularity. For example, plotly.js will have no info on what scatter points to highlight when the user crossfilters on a bar, because it'll be opaque. So we need a solution which preserves links between the original, atomic data and glyphs, markers representing aggregates.

What's the context here? It's at least the dashboard (or its analogs, a Jupyter page or a scrollytelling webpage with multiple plots), because a bunch of interactions, especially things like crossfiltering, or recovering currently highlighted aggregates in a small table, needs to work.

The result is that we can consider our pipeline a directed acyclic graph which

This DAG is not necessarily a layered graph, the user may have a bunch of nodes in sequence along some paths and one transform or direct rendering of data to view in others.

It can go further than that. Individual plots, such as a lowly scatter, use aggregate statistics. For example, we compute min/max per axis so that we zoom on the right area. But there are other valid approaches, eg. fixed domain (which we support), or margining (fixed, relative or even deviation based), etc. etc.

If we deconstruct what individual plots mean, we'll end up with a smallish set of vocabulary and a compact grammar to bind them together, out of which lots of things can be composed. For example, easy to replace an SVG Lego block for a simple point rendering to a Canvas or WebGL Lego block, because the block serves a very limited, narrow, well-specified purpose with clean connections (interface) no matter where it is in the DAG. Basically this is a key idea around ggplot2 and Vega too.

tl; dr

So in short, I think that a new pie with built-in aggregation might be useful for clients but we ideally separate the concern of the outer API - which biases things for high-frequency use and supplies a set of common widgets with sensible defaults - from a more technical, internal structure that maps conceptual relations based on what they are, while perhaps also exposing it for clients who'd like to build custom visualizations with bespoke interactions.

alexcjohnson commented 6 years ago

closed by #2117