Closed etpinard closed 6 years ago
Add a new do-it-all splom
(and possible a splomgl
too) trace type that generates its own internal scatter traces and its own axes - with an api similar to parcoords
:
trace = {
dimensions: [{
values: [/* */],
// some scatter style props ...
// some axis props reused from cartesian axes
}],
// some splom-wide options e.g.:
showdiagonal: true || false,
showupperhalf: true || false,
showlowerhalf: true || false,
direction: 'top-left-to-bottom-right' || 'bottom-left-to-top-right',
// ...
}
Port make_subplots
and append_traces
from the python api in plotly.js (docs). For example:
var Plotly = require('plotly.js')
var fields = [
[/* */],
[/* */],
// ...
]
var layout = Plotly.makeSubplots({rows: fields.length, cols: fields.length})
var data = []
for (var i = 0; i < fields.length; i++) {
for (var j = 0; j < fields.length; j++) {
var trace = {
mode: 'markers',
x: fields[i],
y: fields[j]
}
Plotly.linkToSubplot(trace, i, j)
data.push(trace)
}
}
Plotly.newPlot(gd, data, layout)
This could be combined with solution 2 to solve the data-array-duplication problem. But this would allow require some backend work for plot.ly support.
In short, we could add a new top-level argument to Plotly.newPlot
and Plotly.react
var columns: [
{name: 'col 0', values: [/* */]},
{name: 'col 1', values: [/* */]},
// ...
]
// unfortunately, in this paradigm columns should really be labeled data,
// and data -> traces
var data = [{
x: 'col 0',
y: 'col 1'
}, {
x: 'col 1',
y: 'col 0'
}]
Plotly.newPlot(gd, {
columns: columns,
data: data,
layout: {}
})
I think it's clear we want to encapsulate a splom
in a single trace, like solution 1. Solution 2 won't give the necessary performance benefits. Solution 3 may give some of the performance we need, and may be useful for more generalized trace linking in the future (for example, things like 2dhistogram_contour_subplots where the x and y data are duplicated in the scatter and histogram2dcontour traces, then x and y each get another copy in the 1D histograms) but will still suffer from duplication at the calc/plot level, that I suspect will be prohibitive for us. Likewise it seems to me it's only reasonable to make this as a WebGL type.
The question in my mind is whether we can do it by linking the splom
trace to regular cartesian axes, and using it to tailor the defaults for those axes, or if we need to have even the axes encapsulated in the trace itself. If we can do the former, then we retain the flexibility to display other traces on those same subplots. Extra data that we only have for one attribute pair, for example, or a curve fit, or some different type of display on the diagonal. Or even another splom
that might even have a disjoint set of dimensions from the first (might be a huge headache but see below for more thoughts)
trace = {
dimensions: [{
values: [/* */],
name: 'Sepal Width' // used as default x/y axis titles
xaxis: 'x' | 'x2' ... // defaults to ith x axis ID for dimension i
yaxis: 'y' | 'y2' ...
}],
marker: {
// just like scatter, and all the same ones are arrayOk.
// goes outside the `dimensions` array because the same data point should get
// the same marker in all subplots.
}
// domain settings - not used directly, just fed into the defaults for all the
// individual x/y axis domains
domain: {
// total domain to be divided among all x / y axes
x: [0, 1],
y: [0, 1],
// blank space between x axes, as a fraction of the length of each axis
// possibly xgutter and ygutter?
gutter: 0.1
}
// some splom-wide options e.g.:
// maybe turn these into a flaglist 'upper+lower+diagonal'?
// these and related attrs will affect the default x/y axis anchor and/or side attributes
showdiagonal: true || false,
showupperhalf: true || false,
showlowerhalf: true || false,
// maybe xdirection and ydirection?
direction: 'top-left-to-bottom-right' || 'bottom-left-to-top-right',
// ...
};
layout = {
xaxis: { /* overriding any of the defaults set by SPLOM */ },
xaxis2: { /* */ },
xaxis3: { /* */ },
... ,
yaxis: { /* */ },
...
};
One variation that might be nice but I'm not sure: separate the list of axes from the dimensions. This could make it easier for example to reorder the dimensions without having to do all sorts of gymnastics with swapping axis attributes (though we might need to swap axis titles still, if they're not inherited from the dimension names):
trace = {
dimensions: [{
values: [/* */],
name: 'Sepal Width' // used as default x/y axis titles
// some scatter style props ...
}],
xaxes: ['x', 'x2', 'x3', ...], // defaults to the first N x axis IDs. info_array, Not data_array.
yaxes: ['y', 'y2', 'y3', ...],
...
}
Also, it might be nice to move the axis arrangement to layout
, but still have splom
provide defaults for this. That way we could reuse it for other cases that want a grid of axes, not just splom
:
// splom trace would still have axis ids in it but no axis layout info (domain or gutter)
layout = {
grid: {
xaxes: ['x', 'x2', 'x3', ...],
yaxes: ['y', 'y2', 'y3', ...],
domain: { x: [0, 1], y: [0, 1] },
gutter: 0.1
}
}
Cases like splom would use a 1D arrays of x/y axes, as all rows share the same x axes and all columns share the same y axes, but we could also allow 2D arrays for when you want a grid of uncoupled axes. And if you put '' in any entry it leaves that row/col/cell blank, and at some point we can make a way to refer to empty cells in other trace/subplot types - so in a pie
trace or a 3d scene
etc you could add something like gridcell: [1, 2]
which would automatically generate the appropriate domain
for you.
Actually, this would make it easy to support multiple splom
traces regardless of whether they have the same or different dimensions:
supplyDefaults
we'd look through all splom
traces and find the full set of xaxes
and yaxes
to use as the defaults in fullLayout.grid
(but the user could override these lists if they wanted) as well as to populate the axis and subplot lists in fullLayout._subplots
.fullLayout.grid
, we'd coerce grid.domain
and grid.gutter
.gridcell
attributes), default domain
values would be generated based on grid
.supplyDefaults
step, grid
and gridcell
attributes would be ignored because the appropriate domain
values would have been filled in already.That way all of this would happen automatically if you just make a splom
trace with N dimensions and don't say anything about its layout, but you could alter it all at various stages if you want to.
What I'm trying to avoid above, but might be even higher performance at the expense of flexibility,
as the axis rendering could be tailored to the splom
case:
trace = {
dimensions: [{
values: [/* */],
xaxis: { /* all the x axis attributes like title, tick/grid specs, fonts, etc */ },
yaxis: { /* same for y - or these could go in xaxes/yaxes arrays but still in the trace */ }
}]
}
or in trace.xaxes
and trace.yaxes
which would be arrays of objects rather than arrays of IDs... either way the point is no other traces would be able to use these axes, which means they could use stripped down rendering machinery for better performance but less flexibility.
My hope though is that the SVG axis machinery is fast enough, especially if we avoid having splom
contribute to fullLayout._subplots.cartesian
or fullLayout._subplots.gl2d
(which would scale quadratically with number of dimensions, vs the number of x/y axes, fullLayout._subplots.(x|y)axis
, which scale linearly) so we only draw the axes in SVG, and let splom
draw gridlines (if required) in WebGL.
Thanks for the :books: @alexcjohnson
I'm a big fan of those xaxes
and yaxes
info arrays in the traces :+1: Using the plural here is great as they won't conflict with the current xaxis
/ yaxis
trace attributes.
About your grid
proposal, I'm curious to see if we could combine the numerous xy subplot-wide but not graph-wide requested settings in them (https://github.com/plotly/plotly.js/issues/1468, https://github.com/plotly/plotly.js/issues/233, https://github.com/plotly/plotly.js/issues/2274 and per subplot plot_bgcolor
to name a few).
Now, to give a more concrete example (to e.g. @dfcreative :wink:), the Iris splom (e.g. https://codepen.io/etpinard/pen/Vbzxqa) would be declared as:
var url = 'https://cdn.rawgit.com/plotly/datasets/master/iris.csv'
var colors = ['red', 'green', 'blue']
Plotly.d3.csv(url, (err, rows) => {
var keys = Object.keys(rows[0]).filter(k => k !== 'Name')
var names = rows.map(r => r.Name).filter((v, i, self) => self.indexOf(v) === i)
var xaxes = keys.map((_, i) => 'x' + (i ? i + 1 : ''))
var yaxes = keys.map((_, i) => 'y' + (i ? i + 1 : ''))
var data = names.map((name, i) => {
var rowsOfName = rows.filter(r => r.Name === name)
var trace = {
type: 'splom',
name: name,
dimensions: keys.map((k, j) => {
// 'label' would be better here than 'name' (parcoords uses 'label')
label: k,
values: rowsOfName.map(r => r[j]),
}),
marker: {color: color[i]},
// the default (for clarity)
showlegend: true,
xaxes: xaxes,
yaxes: yaxes
}
return trace
})
var layout = {
grid: {
xaxes: xaxes,
yaxes: yaxes
domain: { x: [0, 1], y: [0, 1] },
gutter: 0.1
}
}
Plotly.newPlot('graph', data, layout)
That is, one splom trace per :wilted_flower: type and one dimension per observed field in each trace.
My hope though is that the SVG axis machinery is fast enough, especially if we avoid having splom contribute to fullLayout._subplots.cartesian or fullLayout._subplots.gl2d (which would scale quadratically with number of dimensions, vs the number of x/y axes, fullLayout._subplots.(x|y)axis, which scale linearly) so we only draw the axes in SVG, and let splom draw gridlines (if required) in WebGL.
Interesting point here about the grid lines. It shouldn't be too hard to draw them in WebGL (much easier than axis labels :wink: at least), if we find SVG too slow.
May I add my 2¢? Why don't we just use existing scatter trace data/naming convention as
Plotly.newPlot(document.body, [{
type: 'scattermatrix',
x: [[], [], ...xdata],
y: [[], [], [], ...ydata]
}])
That would be familiar already for the users who know trace types and options.
May I add my 50 cents?
Usually it's 2¢ but we like you so sure :)
Why don't we just use existing scatter trace data/naming convention
Two things I don't like about this:
Anyway we do have a precedent for the structure I'm proposing, in parcoords
. Then the marker
attributes would be inherited directly from scatter
About your
grid
proposal, I'm curious to see if we could combine the numerous xy subplot-wide but not graph-wide requested settings in them (https://github.com/plotly/plotly.js/issues/1468, https://github.com/plotly/plotly.js/issues/233, https://github.com/plotly/plotly.js/issues/2274 and per subplotplot_bgcolor
to name a few).
I suppose we could let grid
provide these settings, the same way grid
would be providing domain
values for individual axes. But I wouldn't want this to be the only way to provide per-subplot settings, because not every multi-subplot layout can be described as a grid - think of insets, or layouts like
+-------+ +---+
| | | |
| | +---+
| | +---+
| | | |
+-------+ +---+
I guess ^^ could be massaged into the grid format with concepts like colspan
/ rowspan
, and maybe we'll do that, but that would still make it awkward to provide per-subplot attributes, and insets would still be difficult to describe this way.
So I still think we'll need something like https://github.com/plotly/plotly.js/issues/2274#issuecomment-359310606 but perhaps grid would be allowed to provide defaults to that when the layout is conducive to it.
@dfcreative don't worry about grid
while implementing splom
- just use explicitly positioned x and y axes, and I'll work on grid
separately, then once it and splom
are both ready we can integrate them.
Branch splom
has some preliminary work on the user-attributes-full-attributes side of things (i.e. pretty much everything except the regl-scatter2d
calls).
Things to note:
splom
traces have their own basePlotModule
(similar to pie, parcoords, ...) that reuses some Cartesian methodssplom
default step generates default xaxes
and yaxes
list using the number of dimensions
the trace hasgrid.xaxes
and grid.yaxes
defaultsfullLayout._subplots.cartesian
and fullLayout._subplots.(x|y)axes
so that things just works.Just a couple of clarifying questions:
splom
traces have their ownbasePlotModule
Sounds great, just as long as this doesn't restrict us from displaying other data (be it splom or some other trace type) on the same axes.
we'll make one regl-scatter2d (or equivalent) call per splom trace
I'm not really sure what a regl-scatter2d
call entails, but the key optimization we need over making a million scattergl
subplots is to only upload the values data for each dimension to the GPU once, even though it will appear in somewhere between N-1 and 2N subplots. Does this strategy do that?
Sounds great, just as long as this doesn't restrict us from displaying other data (be it splom or some other trace type) on the same axes.
Yes, for sure :ok_hand:
I'm not really sure what a regl-scatter2d call entails, but the key optimization we need over making a million scattergl subplots is to only upload the values data for each dimension to the GPU once, even though it will appear in somewhere between N-1 and 2N subplots. Does this strategy do that?
Here's a sneak peak:
Here are some observations on splom-generated cartesian subplots:
Off the splom
branch with commits from https://github.com/plotly/plotly.js/pull/2474 and using the following script:
var Nvars = ???
var Nrows = 2e4 // make no difference for now
var dims = []
for(var i = 0; i < Nvars; i++) {
dims.push({values: []})
for(var j = 0; j < Nrows; j++) {
dims[i].values.push(Math.random())
}
}
Plotly.purge(gd);
console.time('splom')
Plotly.plot(gd, [{
type: 'splom',
dimensions: dims
}])
console.timeEnd('splom')
I got:
where I added console.time
/ console.timeEnd
pairs in the slowest subroutines i.e. the ones that scale with the total number of subplots or Math.pow(dimensions.length, 2)
A few quick hits:
initInteractions
execution can be :hocho: by setting staticPlot: false
(duh) but even setting the more obscure config option showAxisDragHandles
and showAxisRangeEntryBoxes
to false can reduce its execution time by a factor of 4lsInner
is currently called twice via layoutStyles
here and here (and a third time on graphs with margin-pushing things). At 40 dimensions (that's 200 subplots), it takes a whooping 2700ms to execute. That is, more that half of the total plotting time is in there. I'll try to first make sure the slow parts are called only once. But, we might need more aggressive optimization at some pointAxes.doTicks
speeds up the doAxes
step by a factor of 2. That's good because we can probably use regl-line2d
to draw those lines more efficiently. That said, we'll also have to speed label-drawing step mostly via https://github.com/plotly/plotly.js/issues/1988 and fixOverlappingLabels
.Work in progress https://dfcreative.github.io/regl-scattermatrix/
Quick update:
lsInner
calls was easy enough in https://github.com/plotly/plotly.js/pull/2474/commits/e810c1ee55900132caa009b5f96e1644d272634b. Next, I'll try to merge as much logic as possible from Cartesian.drawFramework
with lsInner
so that we can hopefully loop over all the <g subplot>
only once.Interesting finding:
Drawing.setClipUrl
call can speed up lsInner
by 10x at 40 dimensions (or 1600 subplots)! Even when the page has no <base>
! I suspect that traversing the DOM when you have 1600 <g subplot>
is slow :turtle: (duh!). This should be an easy fix: call d3.select('base')
once (i.e. not for every Drawing.setClipUrl
call ) and stash it somewhere.There's also document.baseURI
perhaps we can bypass base, just check if
document.baseURI === window.location.href
too bad. Although :arrow_heading_up: is from w3school :laughing:
https://developer.mozilla.org/en-US/docs/Web/API/Node/baseURI is incomplete:
New benchmarks post https://github.com/plotly/plotly.js/pull/2474/commits/5887104139256934bbf554bf62685fbec62585d2 (which I pushed to https://github.com/plotly/plotly.js/pull/2474 - hopefully @alexcjohnson won't mind):
Things are looking up :guitar:
Next steps:
Math.pow(dimensions.length, 2)
)A first attempt at drawing grid lines using @dfcreative 's regl-line2d
was positive.
Here are the numbers (in ms) with all axes having the same gridcolor
and gridwidth
:
# of dims | SVG | regl-line2d |
---|---|---|
10 | 70 | 80-100 |
20 | 200 | 140-150 |
30 | 500 | 150-200 |
40 | 800 | 300 |
50 | 1500 | 350 |
In brief, we start to see improvements over SVG at around 15 dimensions (i.e 15x15=225 subplots).
SPLOMs are coming to plotly.js.
For the uninitiated, docs on the python api
scatterplotmatrix
figure factory are here. Seaborn calls it a pairplot. Matlab has plotmatrix draw function.Some might say that SPLOMs are already part of plotly.js: all we have to do is generate traces for each combination of variables and plot them on an appropriate axis layout (example).
But, this technique has a few limitations and inconveniences:
Numerous solutions are available. This issue will attempt to spec out the best one.
cc @dfcreative @alexcjohnson @cldougl @chriddyp