plotly / plotly.js

Open-source JavaScript charting library behind Plotly and Dash
https://plotly.com/javascript/
MIT License
17.01k stars 1.86k forks source link

add _real_ stacked area charts [feature request] #1217

Closed salim-b closed 6 years ago

salim-b commented 7 years ago

I'm filing this issue as a gathering point for the feature request of real stacked area charts.

The current solution to create stacked area charts is to plot cumulative variables which has multiple drawbacks:

A real solution would be to have an argument like layout = {linemode: 'stack'}, same as there is for bar charts.

I'm aware that @etpinard stated some time ago:

We are planning on adding area charts to our list of trace types at some point this year. You are not the first person to ask for this feature.

Therefore I hope opening this issue won't be seen as an annoying harassment! Since it is an often requested feature, I think it important to have a place for users like me to gather all relevant information and potential progress on this.

Some more information:

treadmillian commented 7 years ago

What are the timescales for this requested feature? This is one reason why I've turned to HightCharts which does this perfectly.

Reubentrapdoor commented 7 years ago

Been monitoring this one for a long time as well hoping to see some progress, all the workarounds I've seen are way too buggy to use in any production environment

nicolaskruchten commented 6 years ago

Here's a rundown of examples that show a reasonably well-specified way of dealing with misaligned x values, assuming we're stacking in the y dimension: https://docs.google.com/spreadsheets/d/1XZ2yEIN4S_Q-6c6AHr4F9LyKukQAcyg0gh-1VcumO9M/edit

nicolaskruchten commented 6 years ago

/cc @alexcjohnson

Excel:

image

Google Sheets:

image

Two different approaches to the drawing of lines where there is no data... i.e. in the second chart: Google Sheets draws a red line but Excel does not. Probably worth thinking carefully about the tradeoffs there, and ditto about what to show in hovers.

alexcjohnson commented 6 years ago

I don't think there's actually a difference between the two: Excel isn't drawing lines at all, only fills, so there's nothing to omit, but presumably you can turn lines on and then I bet they would be drawn the same way as Google's.

As @nicolaskruchten alludes to, the question of what to do with mismatched x values is the key sticking point here, the reason adding stacked area charts is not as easy for us as adding stacked bars.

Google sheets (and Excel, at least by default, I haven't looked in detail) has a simpler data model than we do: every series shares the same x data, so it's not possible to have mismatched x, the most you can do is have empty y at certain x values. They seem to treat those empties as zeros. That's certainly a plausible interpretation for certain data anyway, but not all, and it differs from how we handle scatter (line) in other contexts - where a missing y (or x for that matter) either leaves a gap or gets the line drawn straight from one valid point to the next, depending on the connectgaps setting.

Seems to me when we stack area charts we can internally fill in missing x values across all the stacked traces, and then there are perhaps three ways you might want to interpret gaps:

So if the second and third cases are covered by connectgaps, what about the first (which, to fit with Google & Excel, should be the default)? I suppose it could be a new connectgaps: 'zero' or something? There would also be an argument for making this a separate setting, so that x values with an invalid y ('', null, non-numeric) would be treated differently from x values that get inserted just because they're present in other data sets. Perhaps you'd like newly-inserted x values to get y=0 but invalid y to be treated as a gap?

I guess I can imagine cases where that would be the "most correct" way to display the data, though it might be more complexity than users really want. On the other hand making a new setting for this would allow us to avoid turning connectgaps into another "boolean plus a string" enumerated attribute, as well as avoiding extra logic around its default value. And mostly people would just use the default value of this new setting. So what could this new attribute be? How about stackgaps: ('zero' (dflt)|'gap'|'interpolate')?

etpinard commented 6 years ago

then there are perhaps three ways you might want to interpret gaps:

I can't think of any other scenario. Thanks for writing those down in detail :ok_hand:

How about stackgaps: ('zero' (dflt)|'gap'|'interpolate')?

To my eyes, adding a new attribute for this is a no-brainer :+1:

Booleans-plus-a-string aren't great, but also any enumerated attribute that has one or multiple values that can have no effect depending on other things in that trace should be avoided when possible.

I was going to suggest fillgaps before reaching the end of your comment. But stackgaps is better as it makes it clear that this new attribute has an effect only for stacked scatter traces.

That said, could this new attribute help us alleviate some current less-than-ideal fill problems? There are many open issues about this: https://github.com/plotly/plotly.js/issues/1132, https://github.com/plotly/plotly.js/issues/1867, https://github.com/plotly/plotly.js/issues/113, https://github.com/plotly/plotly.js/issues/1205 and possibly others.


As an aside, this problem of mismatching x in stacked area chart appear very plotly specific. Both MATLAB and mpl assume the same independent coordinates for all their stacked area "y" arrays.

etpinard commented 6 years ago

Writing down some questions I have about stacked area hover:

So, perhaps we could add new hoverinfo flags e.g. 'xstack', 'ystack' and making the hoverinfo default be 'x+ystack+text+name' for stacked scatter traces? These new hoverinfo keys could be used in stacked bar charts too.

alexcjohnson commented 6 years ago

That said, could this new attribute help us alleviate some current less-than-ideal fill problems? There are many open issues about this: #1132, #1867, #113, #1205 and possibly others.

mmm, there's some interaction between this issue and some of those - particularly #1132 and #1205 - but I think those are pretty much all implementation issues, not problems of specifying the desired behavior.

As an aside, this problem of mismatching x in stacked area chart appear very plotly specific. Both MATLAB and mpl assume the same independent coordinates for all their stacked area "y" arrays.

True, that's because both of those create all stacked lines in a single function call, so conceptually as a single object. We could in principle do the same, treating the entire stack as a single trace, but we can do better than that. I've certainly encountered plenty of situations like the population example I described, where I wanted to add up data that didn't come with matching x values, and I'd have loved it if this just worked ™️

Should we display the "true" y datum or the stacked (i.e. cumulative) y value?

hoverinfo: 'x+y+total' (or 'sum' or something). Where 'y' is the value of the trace you're focused on... I'm not quite sure whether 'total' is the partial or complete sum... I guess perhaps we want to allow both ('x+y+partial+total'?) but I'm not sure which should be the default? When you're hovering on a single point I can see wanting to know the sum of everything up to and including that point (that's where your cursor is after all... but there may also be particular subtotals you're interested in), as well as the total of everything in the stack so you can quickly see "this item is 10% of the total" (maybe we even want hoverinfo: 'percent' or something?)

Similarly, should we include the y datum or the stack value in the plotly_(hover|click) event data?

I think the y field in the event data should be the y datum but we should also include subtotal and total as separate fields.

These new hoverinfo keys could be used in stacked bar charts too.

😍

nicolaskruchten commented 6 years ago

I have a concern about gap ... How would that render? If I have a single data point for a given trace with gaps on either side, then it will look like an empty quadrilateral with a single dot? Otherwise it would in effect be the same as zero no?

etpinard commented 6 years ago

I guess perhaps we want to allow both ('x+y+partial+total'?)

I'm a big fan of this. Adding flags 'partial', 'percent' and 'total' would cover a lot of use cases :ok_hand:

I think the y field in the event data should be the y datum

I agree 100% here. Moreover, we could add partial, total and percent keys in the event data for symmetry with the hoverinfo flags.

alexcjohnson commented 6 years ago

I have a concern about gap ... How would that render? If I have a single data point for a given trace with gaps on either side, then it will look like an empty quadrilateral with a single dot? Otherwise it would in effect me the same as zero no?

right - an "orphan point" we've called that in the past - it doesn't make a line segment either, so if you don't show markers you won't see anything.

etpinard commented 6 years ago

right - an "orphan point" we've called that in the past

Yeah issues with orphan points go way back -> https://github.com/plotly/streambed/issues/2577

nicolaskruchten commented 6 years ago

OK so what happens when you have a stack of areas with an orphan point in the middle (and you can't reorder because, say, they all have misaligned orphaned points)?

alexcjohnson commented 6 years ago

If you're not filling gaps (with zeros or interpolations), then anything above a gap gets discarded - that's what I meant by "probably we'd want all gaps to propagate upward to the top of the stack."

nicolaskruchten commented 6 years ago

Wow, that seems... draconian. So much so that I'm not sure anyone would really want to use it?

nicolaskruchten commented 6 years ago

The alternative I was envisioning was something like this (C3js output), which is also quite problematic:

image

alexcjohnson commented 6 years ago

Wow, that seems... draconian. So much so that I'm not sure anyone would really want to use it?

This wouldn't be the default for gaps introduced by the stacking process - the default would match gsheets and excel and fill with zeros, which if I'm interpreting your party/province plot right is probably what you'd want to have there, right? Missing items are not unknown data, they're cases of zero count.

But I can certainly imagine doing an analysis and not wanting to make any assumptions about missing data, especially if that missing data is explicit in the data as an x with no/invalid y. And really the only way to do that is to throw out the unstackable data.

nicolaskruchten commented 6 years ago

I understand where you're going with this, certainly it makes sense from an SQL-like null-propagation point of view. One worry I have around both interpolate and gaps is that we support neither in our stacked-bar implementation as far as I know.

One other salient point of comparison between stacked bars and stacked areas are handling of negative values. Google Sheets/Excel basically handles this by overlapping/"folding" the area downwards, which is sort of how our stack mode operates, but I'm not sure I can imagine an area equivalent of our relative mode :)

Google sheets:

image

Excel:

image

nicolaskruchten commented 6 years ago

And FWIW, Highcharts interrupts the area stacking for missing values and 'folds' for negative values:

image

nicolaskruchten commented 6 years ago

Final note for the weekend: it would be nice to have the equivalent of barnorm to do "100% stacked area" charts :)

alexcjohnson commented 6 years ago

One worry I have around both interpolate and gaps is that we support neither in our stacked-bar implementation as far as I know.

The key difference, from my standpoint, between stacked bars and stacked area is the physical connection between subsequent points. Which is why interpolation makes little sense for bars but a lot of sense for area, because in most cases you're showing an interpolation already. The big exception to this is a categorical x axis like in your party/province chart above, where the lines are drawn really just as a form of object constancy (the object being each province).

The argument in favor of gaps could also be applied to bars - in as far as the total is important and you don't want to make any assumptions about missing data, you could say that we should truncate a bar stack at a missing value. The problem with that is that with bars you can't tell the difference between a truncated stack and just all zero entries above it; whereas with area, you clearly see the line(s) stop if there's a gap vs drop smoothly to zero if all higher entries are zeros. I don't see a good option to disambiguate this with bars.

One other salient point of comparison between stacked bars and stacked areas are handling of negative values. Google Sheets/Excel basically handles this by overlapping/"folding" the area downwards, which is sort of how our stack mode operates, but I'm not sure I can imagine an area equivalent of our relative mode :)

Yes, seems like folding is the way to go here, and we share (by default) Google's semitransparent fills, which helps a bit with interpreting these folds. I guess in principle you could imagine a relative mode where the positive area goes to zero at the same time as the negative area grows: screen shot 2018-06-03 at 11 17 29 am seems a bit weird though.

Highcharts interrupts the area stacking for missing values

That's an interesting option - a gap for the series that has the gap, then make the same area you would have made for all higher traces but slide them down into the gap. It's a little weird that it makes it look like there's something strange in the data for the higher traces at the points next to the gap, but at least a) you see as much of the total as is known, b) you see that something is weird so you are alerted not to infer too much from the data around there, and c) if you look carefully enough at it you can figure out which data point is missing. So yeah, I guess I like it, I'd be fine using that behavior for stackgap: 'gap' mode. One other thing to note, all markers end up in the same places as they would with stackgap: 'zero' - which isn't necessarily an argument in favor of it, but may be nice for implementation.

nicolaskruchten commented 6 years ago

I'd be OK with stackgaps: (inferzero | interpolate | interrupt) as an API, personally, with the final option being the Highcharts behaviour rather than not rendering all points above. I'll be curious to see what orphan points will look like there. Full disclosure: I'd be fine with not implementing that final mode in the first version of this thing and leaving it as a nice-to have... just having interpolate mode already puts us ahead of other charting systems IMO :)

Other than that, what could the API look like? Is this a new trace type? Would this be a layout level attribute like barmode that applies to all matching traces, regardless of subplots? Where would we specify the equivalent of barnorm?

alexcjohnson commented 6 years ago

stackgaps: (inferzero | interpolate | interrupt)

👍 though in atttribute values we've tended to include spaces between words, so it could be stackgaps: ('infer zero' | 'interpolate' | 'interrupt')

I'd be fine with not implementing that final mode in the first version of this thing and leaving it as a nice-to have

Absolutely - the goal here is just to make sure the API will support all the options we anticipate, but we can start with the default behavior.

I'll be curious to see what orphan points will look like there.

Not great... you won't see them at all unless markers are displayed. Another option we could consider, that would be better for orphan points and perhaps alleviate my concern about the Highcharts behavior making it look like the neighbors are weird rather than the missing point itself: draw the fill halfway to the missing point before breaking it (probably following the same path that would be taken by 'interpolate', though you could imagine other options like extrapolating as a constant from either side), something like: screen shot 2018-06-04 at 9 43 50 am This way orphan points would generate a fill spanning from halfway to the preceding missing point through halfway to the following missing point, essentially like bars unless it's the first or last point.

API

tldr this is what I'm proposing:

data: [
  {
    type: 'scatter',
    x: [...], y: [...],
    stackgroup: '1', // this (any non-empty value) is what enables stacking
    orientation: 'h', // like horizontal stacked bars - along with stackgroup this sets default fill attr
    groupnorm: 'percent',
    stackgaps: 'interpolate'
  },
  {
    x: [...], y: [...],
    stackgroup: '1',
    orientation: 'h',
    // groupnorm here would be ignored unless omitted above
    stackgaps: 'infer zero'
  }
],
layout: {
  // could specify groupnorm, stackgaps here instead if uniform
}

There would be advantages to making the whole stack into a single trace, with an array similar to dimensions from parcoords/splom, which is effectively how all the others mentioned above do it.

But I still think we're better off leaving this as a collection of scatter traces:

Re: a barmode analog - we talked about the current limitations of barmode (grouped stacks, and subplots with different styles). Seems like (as we've discussed before) in the bar case we can alleviate that with a trace-level attribute like stackgroup, which would have arbitrary values and group matching items like legendgroup does (in this case it would also group by subplot, and I guess by orientation, see below). As far as I can see, the same logic should apply to stacked area. We could imagine a layout-level attribute that says "stack all scatter traces" but it might be cleaner to just require a stackgroup attribute to activate stacking.

stackgaps could be layout-level or trace-level. I feel like users would generally want to provide that setting graph-wide but implementation-wise it'll be just as easy to use the layout-level attribute as the default for the trace-level one.

Re: a barnorm analog - currently barnorm is only a layout-level attribute. Per its name it only applies to bar traces, but note that it applies to barmode: 'stack' and 'group' bar traces, ie anything but barmode: 'overlay' (in that case there's no "group" so you'd always be normalizing by the sum of a single item). I'm tempted to suggest an attribute layout.groupnorm to supersede barnorm and include stacked area - though of course there are groups this should not apply to, box and violin 🤔. but then the question is how to specify a per-group normalization. So I can see two ways to specify this:

The latter is arguably more correct, as there's no ambiguity, there's exactly one place to specify one value. But it seems heavy and potentially confusing to users. Actually it becomes even more complicated with the planned extension for bars - if we have grouped stacks, you might want to normalize so each stack reaches 100%, or you might want to normalize so the sum of all stacks in each group is 100%. The former seems like the more natural (and more common) case, the latter could perhaps be another barnorm value, but note that this setting would in principle apply per subplot, not per stackgroup. That seems like a more concrete strike against layout.groups.

One more thought: do we want to allow stacking horizontally, not just vertically? Perhaps an orientation: ('v'|'h') to match bars, which also makes switching between them easier. I thought about if we could enable this just by setting fill: 'tonextx' instead of the normal fill: 'tonexty', but that seems a little too magical, and would also cause problems if you wanted to include an unfilled trace in the stack, perhaps as a baseline that everything else is stacked on. I suppose though we could let orientation set the default fill, since orientation has no purpose without stacking.

More data issues

Two more related questions about x values (or y if horizontal):

nicolaskruchten commented 6 years ago

I love the half-area rendering for interrupt mode. I think it's a really elegant solution.

Re duplicates and ordering, I would want to just stick with whatever we currently do for filling, which can lead to some crazy results, but at least we're not introducing a whole new way of filling...

image

Re the layout vs trace location for these options, I would favour looking at this in a layered way with reference to the current behaviour of layout.barmode... We could start by introducing layout.scattermode with the current behaviour being the default, called "overlay" (ditto layout.scatternorm, default ""). With something like this in place then stacked areas are doable (perhaps with an extra trace.orientation to control the direction) with all the same limitations as bars: only one stack per subplot, all subplots have the same mode etc. We could introduce layout.scattergap for the gap handling, or try to roll it into connectgaps

With that done, we could look at tackling these limitations in a unified bar+scatter way. It seems to me like we want to introduce the notion of "sub stacks", especially when barmode/scattermode is set to something other than stack (in which case it's still well-defined but mostly redundant). So we could add trace.substack and barnorm/scatternorm = substack percent or something to account for the case Alex mentioned above. This would allow "grouped stacks" of bars as well as, say, one subplot with stacked bars and one with grouped bars. The normalization mode would still be figure-wide under this conception.

alexcjohnson commented 6 years ago

Re duplicates and ordering, I would want to just stick with whatever we currently do for filling, which can lead to some crazy results, but at least we're not introducing a whole new way of filling...

I don't think we can get away from doing something new here.

If you have unordered data below and ordered above, you'll be stacking as though the values below were ordered, creating entirely new strange behavior. Anyway, when you're just filling but not stacking there are legitimate reasons to have unordered data, but it seems to me that when you're stacking you've made a strong statement that y is a function of x (or vice versa if horizontal) so we can only help the user by sorting.

If you have duplicates below and unique values above, which one do you add onto? One way to arrive at the solution I gave above is to imagine the duplicate points start out at slightly different x and you take the limit as they push to the same x. I can imagine this arising in data for example if you've sampled some quantity in time, and took multiple samples on one day but only recorded the date. Seems to me showing (and stacking on) all the points is the most faithful we can be to the input data.

alexcjohnson commented 6 years ago

BTW the gradient on part of the orange fill in @nicolaskruchten 's comment above seems to be a Chrome + Mac Retina screen rendering bug - fiddling around with similar multiply-self-crossing paths I can get all manner of related errors on my laptop's main (retina) screen, but they all look fine when I put the window on my second monitor (non-retina) or in FF or Safari on the retina screen. I'm going to ignore it and hope Chrome fixes it.

etpinard commented 6 years ago

Throwing in my cents in decreasing order of importance:

data = [{
  type: 'bar',
  // ...
  bargroup: '0',
  // these below would apply to all traces of this bargroup
  bargap: 0.1,
  bargroupgap: 0.05,
  barnorm: 'percent'
}, {
  type: 'bar',
  // ...
  bargroup: '0',
  // would not coerce 'bargap', 'bargroupgap, ...
  // if not first trace in bargroup,
  // Plotly.validate would pick this up!
}]

which only adds one new attribute, bargap and friends are simply moved from the layout to the data[0] attribute containers.

{
  stackgroup: '1',
  // only coerce if stackgroup is set
  stack: {
    orientation: 'v',
    groupnorm: 0.1,
    gaps: '...'
  }
}
nicolaskruchten commented 6 years ago

Looks like @nicolaskruchten's main argument for new "stack" trace layout attributes is symmetry with bar traces. We could perhaps add "real" trace attributes for bar in a preliminary PR to setup a symmetry

Yes, indeed my proposal was primarily motivated by symmetry. If we could implement trace-level stacking control for bar first (which is desirable in and of itself, as it allows "grouped stacks" and per-subplot stack-vs-group control), then I would be totally fine with stacked-areas not having layout-level attributes and just following the new bar pattern instead.

I'm also fine with the sorting/not just reusing the existing fill behaviour.

alexcjohnson commented 6 years ago

I not a fan of trace "layout" attributes. ... I would even vote for deprecating all trace layout attributes in v2.

That's a pretty strong statement! But I think I can get behind it. Thinking through the details of individual use cases there are still a number of decisions to make, but I think we can work it out. That said...

(second trace) would not coerce 'bargap', 'bargroupgap, ... if not first trace in bargroup, Plotly.validate would pick this up!

My concern about this is its impact on reordering traces - which I suspect is fairly common in exactly these cases where traces within a group interact with each other. If I have a lot of stacked items there may be different ways I want to organize them, and if this resulted in moving the first trace out of its spot, I'd need to also move the group attributes to the new first trace.

What if we just take the first value we find for these attributes, looking at every trace in the group, and apply that value to all of them in fullTrace? Then Plotly.validate would naturally complain if two traces contained different values but not if two contained the same value. Also think about hiding the first trace in the group - seems like these attributes should still apply even from an explicitly visible: false trace.

I think using orientation in a scatter trace would be confusing. ... We would again lose symmetry with bar though.

I feel like symmetry with bar - when the function is the same which I think it is here - is worth a good deal, not just in terms of simplifying the editor as folks toggle between bar and area, but from a straight plotly.js user perspective as well, not having to learn more attribute names. Would it suffice to include in its description "applies only to stacked area traces"?

Similarly, I would prefer using stackgroupnorm or stacknorm.

Again I feel like the function is the same so we should use a name that works for both bar and scatter. Right now we have barnorm, which needs the bar qualifier because it's in layout, but once it's in the trace it wouldn't need that. But it's not necessarily a stack normalization, it can apply to grouped bars as well. I might have just called it norm, but we have histnorm in histogram traces, which handles normalization across bars within one trace, and in fact you can currently use trace.histnorm alongside layout.barnorm, for example to show the relative densities of two distributions. It's a little tricky to interpret, since you're normalizing twice across different axes of the data, but it does work:

Plotly.newPlot(gd,[{
  x: [1,1,1,2,2,3], type: 'histogram', histnorm: 'probability'
},{
  // eg 2 results in 50/50 because 2 is one third of the samples in each trace
  x: [1,1,1,1,2,2,2,2,3,3,3,3], type: 'histogram', histnorm: 'probability'
}],{
  barmode: 'stack', barnorm: 'percent'
})

screen shot 2018-06-07 at 2 43 20 pm

So if histnorm is "normalization of the histogram binning results", what do we call "normalization of the trace grouping that we either stacked or grouped side-by-side"? Again, since functionally it's equivalent whether we're talking bars or stacked area I'd like to use the same name for both. If we didn't already use barmode: 'group' it could easily be groupnorm; maybe we just use that, I can't really find anything else nice... setnorm? combonorm? batchnorm? possenorm 🤠?

We could alternatively suggest using a sort transform

TBH I can't really figure out a stacking algorithm that would make sense without sorting, except I guess for the very top trace, so I think sorting needs to be baked in.

etpinard commented 6 years ago

That's a pretty strong statement!

Yeah, I'm aware :muscle:

My concern about this is its impact on reordering traces Also think about hiding the first trace in the group - seems like these attributes should still apply even from an explicitly visible: false trace.

Very good points here! You're absolutely right, taking the first value for find (as opposed to the value of the first trace in the group) is what we want to do. Moreover, perhaps these "group" attributes could be coerced even when visible is set to or inferred false. So that with e.g.

data = [{
  // ...
  bargroup: '1',
  bargap: 0.1
}, {
  bargroup: '1'
}, {
  bargroup: '1'
}]

and toggling visible true/false of data[0] won't affect the bargap for traces in data[1] and data[2].

I feel like symmetry with bar - but from a straight plotly.js user perspective as well, not having to learn more attribute names. Would it suffice to include in its description "applies only to stacked area traces"?

These are valid points. You're right that trying to reuse same attribute names across trace types probably decreases the learning curve for users. As for answering when the attributes are valid, we should encourage users to look up the descriptions on https://plot.ly/javascript/reference/ and use Plotly.validate.

Perhaps to make the applies only stacked area traces part more obvious in the attribute descriptions, we could add 'stack' flag under the scatter mode attribute?

alexcjohnson commented 6 years ago

Closed by #2960