Scatter Plot Matrix (aka SPLOM) discussion

etpinard commented 6 years ago

SPLOMs are coming to plotly.js.

For the uninitiated, docs on the python api scatterplotmatrix figure factory are here. Seaborn calls it a pairplot. Matlab has plotmatrix draw function.

Some might say that SPLOMs are already part of plotly.js: all we have to do is generate traces for each combination of variables and plot them on an appropriate axis layout (example).

But, this technique has a few limitations and inconveniences:

data arrays are duplicated, which impacts performance when the number of variables and/or the data array lengths are large
creating the axes layout and correctly linking the scatter traces is tedious. Note that the python api exposes a few tools to make this smoother, but these aren't available to plotly.js users.
....
feel free to edit and append this list

Numerous solutions are available. This issue will attempt to spec out the best one.

cc @dfcreative @alexcjohnson @cldougl @chriddyp

etpinard commented 6 years ago

Solution 1 (aka splom overlord)

Add a new do-it-all splom (and possible a splomgl too) trace type that generates its own internal scatter traces and its own axes - with an api similar to parcoords:

trace = {
  dimensions: [{
     values: [/* */],
     // some scatter style props ...
     // some axis props reused from cartesian axes
  }],
  // some splom-wide options e.g.:
  showdiagonal: true || false,
  showupperhalf: true || false,
  showlowerhalf: true || false,
  direction: 'top-left-to-bottom-right' || 'bottom-left-to-top-right',
  // ...
}

PROs

easy to make simple case

CONs

not compatible with other cartesian trace (e.g. cannot overlay additional traces on particular subplot)
not compatible with data-ref layout features (e.g. annotations, shapes and images)

etpinard commented 6 years ago

Solution 2 (tooling)

Port make_subplots and append_traces from the python api in plotly.js (docs). For example:

var Plotly = require('plotly.js')

var fields = [
   [/* */],
   [/* */],
   // ...
]

var layout = Plotly.makeSubplots({rows: fields.length, cols: fields.length})
var data = []

for (var i = 0; i < fields.length; i++) {
  for (var j = 0; j < fields.length; j++) {
    var trace = {
        mode: 'markers',
        x: fields[i],
        y: fields[j]
    }
    Plotly.linkToSubplot(trace, i, j)    
    data.push(trace)
  }
}

Plotly.newPlot(gd, data, layout)

PROs

easy subplot generation
does not restrict user, other trace types and layout feature can be added

CONs

still somewhat tedious trace-to-subplot linking
does not address the duplicate array problem

etpinard commented 6 years ago

Solution 3 (data-array reusing)

This could be combined with solution 2 to solve the data-array-duplication problem. But this would allow require some backend work for plot.ly support.

In short, we could add a new top-level argument to Plotly.newPlot and Plotly.react

var columns: [
  {name: 'col 0', values: [/* */]},
  {name: 'col 1', values: [/* */]},
  // ...
]

// unfortunately, in this paradigm columns should really be labeled data, 
// and data -> traces
var data = [{
   x: 'col 0',
   y: 'col 1'
}, {
  x: 'col 1',
  y: 'col 0'
}]

Plotly.newPlot(gd, {
  columns: columns,
  data: data,
  layout: {}
})

PROs

all trace types could benefit from not duplicating data array

CONs

probably the hardest to implement, especially when considering plot.ly backend work.

alexcjohnson commented 6 years ago

I think it's clear we want to encapsulate a splom in a single trace, like solution 1. Solution 2 won't give the necessary performance benefits. Solution 3 may give some of the performance we need, and may be useful for more generalized trace linking in the future (for example, things like 2dhistogram_contour_subplots where the x and y data are duplicated in the scatter and histogram2dcontour traces, then x and y each get another copy in the 1D histograms) but will still suffer from duplication at the calc/plot level, that I suspect will be prohibitive for us. Likewise it seems to me it's only reasonable to make this as a WebGL type.

The question in my mind is whether we can do it by linking the splom trace to regular cartesian axes, and using it to tailor the defaults for those axes, or if we need to have even the axes encapsulated in the trace itself. If we can do the former, then we retain the flexibility to display other traces on those same subplots. Extra data that we only have for one attribute pair, for example, or a curve fit, or some different type of display on the diagonal. Or even another splom that might even have a disjoint set of dimensions from the first (might be a huge headache but see below for more thoughts)

Preferred option: refer to regular cartesian axes

trace = {
  dimensions: [{
    values: [/* */],
    name: 'Sepal Width' // used as default x/y axis titles
    xaxis: 'x' | 'x2' ... // defaults to ith x axis ID for dimension i
    yaxis: 'y' | 'y2' ...
  }],
  marker: {
    // just like scatter, and all the same ones are arrayOk.
    // goes outside the `dimensions` array because the same data point should get
    // the same marker in all subplots.
  }
  // domain settings - not used directly, just fed into the defaults for all the
  // individual x/y axis domains
  domain: {
    // total domain to be divided among all x / y axes
    x: [0, 1],
    y: [0, 1],
    // blank space between x axes, as a fraction of the length of each axis
    // possibly xgutter and ygutter?
    gutter: 0.1
  }
  // some splom-wide options e.g.:

  // maybe turn these into a flaglist 'upper+lower+diagonal'?
  // these and related attrs will affect the default x/y axis anchor and/or side attributes
  showdiagonal: true || false,
  showupperhalf: true || false,
  showlowerhalf: true || false,

  // maybe xdirection and ydirection?
  direction: 'top-left-to-bottom-right' || 'bottom-left-to-top-right',
  // ...
};

layout = {
  xaxis: { /* overriding any of the defaults set by SPLOM */ },
  xaxis2: { /* */ },
  xaxis3: { /* */ },
  ... ,
  yaxis: { /* */ },
  ...
};

One variation that might be nice but I'm not sure: separate the list of axes from the dimensions. This could make it easier for example to reorder the dimensions without having to do all sorts of gymnastics with swapping axis attributes (though we might need to swap axis titles still, if they're not inherited from the dimension names):

trace = {
  dimensions: [{
    values: [/* */],
    name: 'Sepal Width' // used as default x/y axis titles
    // some scatter style props ...
  }],
  xaxes: ['x', 'x2', 'x3', ...], // defaults to the first N x axis IDs. info_array, Not data_array.
  yaxes: ['y', 'y2', 'y3', ...],
  ...
}

Bonus: layout.grid

Also, it might be nice to move the axis arrangement to layout, but still have splom provide defaults for this. That way we could reuse it for other cases that want a grid of axes, not just splom:

// splom trace would still have axis ids in it but no axis layout info (domain or gutter)
layout = {
  grid: {
    xaxes: ['x', 'x2', 'x3', ...],
    yaxes: ['y', 'y2', 'y3', ...],
    domain: { x: [0, 1], y: [0, 1] },
    gutter: 0.1
  }
}

Cases like splom would use a 1D arrays of x/y axes, as all rows share the same x axes and all columns share the same y axes, but we could also allow 2D arrays for when you want a grid of uncoupled axes. And if you put '' in any entry it leaves that row/col/cell blank, and at some point we can make a way to refer to empty cells in other trace/subplot types - so in a pie trace or a 3d scene etc you could add something like gridcell: [1, 2] which would automatically generate the appropriate domain for you.

Actually, this would make it easy to support multiple splom traces regardless of whether they have the same or different dimensions:

At the beginning of supplyDefaults we'd look through all splom traces and find the full set of xaxes and yaxes to use as the defaults in fullLayout.grid (but the user could override these lists if they wanted) as well as to populate the axis and subplot lists in fullLayout._subplots.
Since there's now a list of axes in fullLayout.grid, we'd coerce grid.domain and grid.gutter.
Then when supplying defaults for the individual axes (as well as other subplots and traces with gridcell attributes), default domain values would be generated based on grid.
After the supplyDefaults step, grid and gridcell attributes would be ignored because the appropriate domain values would have been filled in already.

That way all of this would happen automatically if you just make a splom trace with N dimensions and don't say anything about its layout, but you could alter it all at various stages if you want to.

Alternative: axes also encapsulated in the trace

What I'm trying to avoid above, but might be even higher performance at the expense of flexibility, as the axis rendering could be tailored to the splom case:

trace = {
  dimensions: [{
    values: [/* */],
    xaxis: { /* all the x axis attributes like title, tick/grid specs, fonts, etc */ },
    yaxis: { /* same for y - or these could go in xaxes/yaxes arrays but still in the trace */ }
  }]
}

or in trace.xaxes and trace.yaxes which would be arrays of objects rather than arrays of IDs... either way the point is no other traces would be able to use these axes, which means they could use stripped down rendering machinery for better performance but less flexibility.

My hope though is that the SVG axis machinery is fast enough, especially if we avoid having splom contribute to fullLayout._subplots.cartesian or fullLayout._subplots.gl2d (which would scale quadratically with number of dimensions, vs the number of x/y axes, fullLayout._subplots.(x|y)axis, which scale linearly) so we only draw the axes in SVG, and let splom draw gridlines (if required) in WebGL.

etpinard commented 6 years ago

Thanks for the :books: @alexcjohnson

I'm a big fan of those xaxes and yaxes info arrays in the traces :+1: Using the plural here is great as they won't conflict with the current xaxis / yaxis trace attributes.

About your grid proposal, I'm curious to see if we could combine the numerous xy subplot-wide but not graph-wide requested settings in them (https://github.com/plotly/plotly.js/issues/1468, https://github.com/plotly/plotly.js/issues/233, https://github.com/plotly/plotly.js/issues/2274 and per subplot plot_bgcolor to name a few).

Now, to give a more concrete example (to e.g. @dfcreative :wink:), the Iris splom (e.g. https://codepen.io/etpinard/pen/Vbzxqa) would be declared as:

var url = 'https://cdn.rawgit.com/plotly/datasets/master/iris.csv'
var colors = ['red', 'green', 'blue']

Plotly.d3.csv(url, (err, rows) => {
  var keys = Object.keys(rows[0]).filter(k => k !== 'Name')
  var names = rows.map(r => r.Name).filter((v, i, self) => self.indexOf(v) === i)

  var xaxes = keys.map((_, i) => 'x' + (i ? i + 1 : ''))
  var yaxes = keys.map((_, i) => 'y' + (i ? i + 1 : ''))

  var data = names.map((name, i) => {
    var rowsOfName = rows.filter(r => r.Name === name)

    var trace = {
       type: 'splom',
       name: name,

       dimensions: keys.map((k, j) => {
          // 'label' would be better here than 'name' (parcoords uses 'label')
          label: k,
          values: rowsOfName.map(r => r[j]),
       }),

       marker: {color: color[i]},

       // the default (for clarity)
       showlegend: true,

       xaxes: xaxes,
       yaxes: yaxes
     }  

     return trace
  })

  var layout = {
       grid: {
        xaxes: xaxes,
        yaxes: yaxes
        domain: { x: [0, 1], y: [0, 1] },
        gutter: 0.1
     }
   }

  Plotly.newPlot('graph', data, layout)

That is, one splom trace per :wilted_flower: type and one dimension per observed field in each trace.

etpinard commented 6 years ago

My hope though is that the SVG axis machinery is fast enough, especially if we avoid having splom contribute to fullLayout._subplots.cartesian or fullLayout._subplots.gl2d (which would scale quadratically with number of dimensions, vs the number of x/y axes, fullLayout._subplots.(x|y)axis, which scale linearly) so we only draw the axes in SVG, and let splom draw gridlines (if required) in WebGL.

Interesting point here about the grid lines. It shouldn't be too hard to draw them in WebGL (much easier than axis labels :wink: at least), if we find SVG too slow.

dy commented 6 years ago

May I add my 2¢? Why don't we just use existing scatter trace data/naming convention as

Plotly.newPlot(document.body, [{
  type: 'scattermatrix',
  x: [[], [], ...xdata],
  y: [[], [], [], ...ydata]
}])

That would be familiar already for the users who know trace types and options.

alexcjohnson commented 6 years ago

May I add my 50 cents?

Usually it's 2¢ but we like you so sure :)

Why don't we just use existing scatter trace data/naming convention

Two things I don't like about this:

A given data value isn't x or y, it's used for both in different subplots.
We need labels associated with each dimension, and we may want to be able to rearrange dimensions, both of which are a bit awkward if the data are in a 2D array.

Anyway we do have a precedent for the structure I'm proposing, in parcoords. Then the marker attributes would be inherited directly from scatter

alexcjohnson commented 6 years ago

About your grid proposal, I'm curious to see if we could combine the numerous xy subplot-wide but not graph-wide requested settings in them (https://github.com/plotly/plotly.js/issues/1468, https://github.com/plotly/plotly.js/issues/233, https://github.com/plotly/plotly.js/issues/2274 and per subplot plot_bgcolor to name a few).

I suppose we could let grid provide these settings, the same way grid would be providing domain values for individual axes. But I wouldn't want this to be the only way to provide per-subplot settings, because not every multi-subplot layout can be described as a grid - think of insets, or layouts like

+-------+ +---+
|       | |   |
|       | +---+
|       | +---+
|       | |   |
+-------+ +---+

I guess ^^ could be massaged into the grid format with concepts like colspan / rowspan, and maybe we'll do that, but that would still make it awkward to provide per-subplot attributes, and insets would still be difficult to describe this way.

So I still think we'll need something like https://github.com/plotly/plotly.js/issues/2274#issuecomment-359310606 but perhaps grid would be allowed to provide defaults to that when the layout is conducive to it.

@dfcreative don't worry about grid while implementing splom - just use explicitly positioned x and y axes, and I'll work on grid separately, then once it and splom are both ready we can integrate them.

etpinard commented 6 years ago

Branch splom has some preliminary work on the user-attributes-full-attributes side of things (i.e. pretty much everything except the regl-scatter2d calls).

etpinard commented 6 years ago

Things to note:

splom traces have their own basePlotModule (similar to pie, parcoords, ...) that reuses some Cartesian methods
the splom default step generates default xaxes and yaxes list using the number of dimensions the trace has
we keep track of all splom axes to then use them as grid.xaxes and grid.yaxes defaults
even though splom traces have their own base plot module, we fill in fullLayout._subplots.cartesian and fullLayout._subplots.(x|y)axes so that things just works.
we'll make one regl-scatter2d (or equivalent) call per splom trace

alexcjohnson commented 6 years ago

Just a couple of clarifying questions:

splom traces have their own basePlotModule

Sounds great, just as long as this doesn't restrict us from displaying other data (be it splom or some other trace type) on the same axes.

we'll make one regl-scatter2d (or equivalent) call per splom trace

I'm not really sure what a regl-scatter2d call entails, but the key optimization we need over making a million scattergl subplots is to only upload the values data for each dimension to the GPU once, even though it will appear in somewhere between N-1 and 2N subplots. Does this strategy do that?

etpinard commented 6 years ago

Sounds great, just as long as this doesn't restrict us from displaying other data (be it splom or some other trace type) on the same axes.

Yes, for sure :ok_hand:

I'm not really sure what a regl-scatter2d call entails, but the key optimization we need over making a million scattergl subplots is to only upload the values data for each dimension to the GPU once, even though it will appear in somewhere between N-1 and 2N subplots. Does this strategy do that?

Here's a sneak peak:

etpinard commented 6 years ago

Here are some observations on splom-generated cartesian subplots:

Off the splom branch with commits from https://github.com/plotly/plotly.js/pull/2474 and using the following script:

var Nvars = ???
var Nrows = 2e4 // make no difference for now
var dims = []

for(var i = 0; i < Nvars; i++) {
  dims.push({values: []})

  for(var j = 0; j < Nrows; j++) {
     dims[i].values.push(Math.random())
  }
}

Plotly.purge(gd);

console.time('splom')
Plotly.plot(gd, [{
  type: 'splom',
  dimensions: dims
}])
console.timeEnd('splom')

I got:

where I added console.time / console.timeEnd pairs in the slowest subroutines i.e. the ones that scale with the total number of subplots or Math.pow(dimensions.length, 2)

A few quick hits:

initInteractions execution can be :hocho: by setting staticPlot: false (duh) but even setting the more obscure config option showAxisDragHandles and showAxisRangeEntryBoxes to false can reduce its execution time by a factor of 4
lsInner is currently called twice via layoutStyles here and here (and a third time on graphs with margin-pushing things). At 40 dimensions (that's 200 subplots), it takes a whooping 2700ms to execute. That is, more that half of the total plotting time is in there. I'll try to first make sure the slow parts are called only once. But, we might need more aggressive optimization at some point
Removing the grid-drawing step in Axes.doTicks speeds up the doAxes step by a factor of 2. That's good because we can probably use regl-line2d to draw those lines more efficiently. That said, we'll also have to speed label-drawing step mostly via https://github.com/plotly/plotly.js/issues/1988 and fixOverlappingLabels.

dy commented 6 years ago

Work in progress https://dfcreative.github.io/regl-scattermatrix/

etpinard commented 6 years ago

Quick update:

halving the number of lsInner calls was easy enough in https://github.com/plotly/plotly.js/pull/2474/commits/e810c1ee55900132caa009b5f96e1644d272634b. Next, I'll try to merge as much logic as possible from Cartesian.drawFramework with lsInner so that we can hopefully loop over all the <g subplot> only once.

etpinard commented 6 years ago

Interesting finding:

Commenting out this particular Drawing.setClipUrl call can speed up lsInner by 10x at 40 dimensions (or 1600 subplots)! Even when the page has no <base>! I suspect that traversing the DOM when you have 1600 <g subplot> is slow :turtle: (duh!). This should be an easy fix: call d3.select('base') once (i.e. not for every Drawing.setClipUrl call ) and stash it somewhere.

alexcjohnson commented 6 years ago

There's also document.baseURI perhaps we can bypass base, just check if document.baseURI === window.location.href

etpinard commented 6 years ago

too bad. Although :arrow_heading_up: is from w3school :laughing:

https://developer.mozilla.org/en-US/docs/Web/API/Node/baseURI is incomplete:

etpinard commented 6 years ago

New benchmarks post https://github.com/plotly/plotly.js/pull/2474/commits/5887104139256934bbf554bf62685fbec62585d2 (which I pushed to https://github.com/plotly/plotly.js/pull/2474 - hopefully @alexcjohnson won't mind):

Things are looking up :guitar:

Next steps:

Minimize the number of loops over subplots
Minimize subplot load (i.e. things that scale as Math.pow(dimensions.length, 2))

etpinard commented 6 years ago

A first attempt at drawing grid lines using @dfcreative 's regl-line2d was positive.

Here are the numbers (in ms) with all axes having the same gridcolor and gridwidth:

# of dims	SVG	`regl-line2d`
10	70	80-100
20	200	140-150
30	500	150-200
40	800	300
50	1500	350

In brief, we start to see improvements over SVG at around 15 dimensions (i.e 15x15=225 subplots).

plotly / plotly.js