Proposal for automatic linked selections

jonmmease commented 5 years ago

Overview

This is a high-level issue describing the proposed design for an approach to automating the process of creating cross-dataset linked brushing dashboards using HoloViews. As demonstrated in the glaciers demo, it is already possible to create these dashboards, but it requires a non-trivial amount of manual logic to wire up the selection streams, gather and combine selections, and update the displayed elements. And even then, the resulting dashboard is not very modular, as it's not possible to add additional linked views without modifying the existing structure.

The goal here is to take advantage of the fact that HoloViews visualization elements have information about the underlying dimensions of the datasets used to construct them. Similar to the way that HoloViews can automatically link the axes of matching dimensions across views, we would like to make use of this same information to enable linked selections.

Basic user workflow

The general workflow for building a linked dashboard with this approach is for the user to first construct a holoviews.Dataset object that includes all of the relevant dimensions across all of the elements that will be created.

The visualization elements (Scatter, Bar, etc.) are created from the Dataset using the Dataset.to method. These elements are then combined into a layout. This layout is then transformed into an interactive selection-linked version using a new link_selections operation (Final name still TBD).

import pandas as pd
import holoviews as hv
from holoviews.operation.selection import link_selections
hv.extension('bokeh')

df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv")
iris_ds = hv.Dataset(df)
scatter_el = iris_ds.to(hv.Scatter, "SepalLength", "SepalWidth", groupby=[])
histogram_el = iris_ds.hist("SepalLength", adjoin=False)

original_layout = scatter_el + histogram_el
linked_layout = link_selections(original_layout)
linked_layout

Screen Shot 2019-07-23 at 9 35 15 AM

Rather than construct a holoviews.Dataset object and using the .to method, the higher-level hvplot library may also be used to create visualization elements directly from DataFrames.

import pandas as pd
import hvplot.pandas
from holoviews.operation.selection import link_selections

iris_df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv")

scatter_el = iris_df.hvplot.scatter("SepalLength", "SepalWidth")
histogram_el  = iris_df.hvplot.hist("SepalLength")

original_layout = scatter_el + histogram_el
linked_layout = link_selections(original_layout)
linked_layout

Visualizations across multiple notebook output cells can be linked together by constructing an instance of a new SelectionManager class, and providing that instance to each call to link_selections.

import pandas as pd
import hvplot.pandas
from holoviews.operation.selection import link_selections, SelectionManager

iris_df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv")

scatter_el = iris_df.hvplot.scatter("SepalLength", "SepalWidth")
histogram_el  = iris_df.hvplot.hist("SepalLength")

selection_manager = SelectionManager()
linked_layout1 = link_selections(scatter_el + histogram_el, selection_manager)
linked_layout2 = link_selections(iris_df.hvplot.hist("PetalLength"), selection_manager)

Supported Elements

This approach would support all elements based on tabular data sets that don't require the use of a colormap to be useful.

Anticipated supported elements:

Bars
Scatter
Bivariate
BoxWhisker
Distribution
ErrorBars
Histogram
Labels
Points
Polygons
Spikes
Violin
VectorField
Scatter3D (Plotly)

These are continuous elements that would require a different .select behavior to be supported. See below for details.

Curve
Path
Area
Spread
Path3D (Plotly)

Annotations. Some of these could be supported, but not sure if they should be

Arrow
Box
Ellipse
HLine
Spline

Unknown

Chord, Graph: Are there streams that could be used to get the selected node? If so, and the nodes are drawn from some stable set of categories then this might make sense. But even if not, it might make sense to display selection overlays even if you can't make selections on the element.
Table: Can rows be colored by data and are there selection stream available for highlighting rows?
HexTiles, Raster, Image: Can we find a useful way to color subsets so that you can perceive continuous intensity and categorical color independently? If we turn each selection color into a light to dark colorscale then this would probably make sense.
QuadMesh, TriMesh: Same colorscale issue as above, plus deciding what you would be selecting (full quads/tris I would think).

Elements that would not be supported

RGB, HSV: Color is already fundamental to the element so we can't really use it for selection as well.
Contours: General contours extend beyond the selection area, so I don't think selecting individual contours is generally meaningful.
Sankey, RadialHeatMap: No real dimensions to build a selection on
Tiles: Nothing to select, it's a background

It's possible that these restrictions could be removed in the future. For now, elements that don't satisfy them will be skipped.

Core HoloViews implementation components

The goal is for the eventual implementation of link_selections and SelectionManager to be as small and contained as possible by implementing a series of generally useful concepts in the core of HoloViews. Each of these components will be implemented in a separate PR, and they should be sensible and useful independent of their eventual use in the link_selections operation.

Elements keep a reference to their source Dataset if available

In order to accomplish this goal of automating the process of creating linking selections, it is important that individual visualization elements maintain a reference to all of the dimensions in the original dataset, not only the dimensions that are needed to display the element. Perhaps the simplest way to accomplish this is for elements to maintain a reference back to the holoviews.Dataset object that they were created from. This won't always be available, in which case the dataset will be None. This can be implemented by adding a new read-only .dataset property to all elements (all LabelledData subclasses?), and updating various functions throughout the core to add or preserve this field.

Dataset.to should specify itself as the .dataset property of the element it returns.
Dataset.hist should do the same.
Constructing a dataset from an element with a known .dataset should return the dataset. This means that dataset is Dataset(dataset.to(hv.Scatter)) should be True.
Casting an element to another element type (using .to or the element constructor) should preserve the .dataset property.
Indexing and selecting should return an element that includes a version of the .dataset property that has been indexed/selected like the element.

General Principle: if an element has a .dataset property then it should be possible to reconstruct an identical element using the .dataset data and the kdims/vdims metadata. (Histogram is a slight known exception to this rule, because the bin edges from the original dimensions are needed as well). And it should be possible to create an element that references only a subset of the data using the select approach below.

Select data by dim expression

Data selections will be represented as symbolic expressions built using the holoviews.util.transform.dim class. To make it more natural to use these expressions for this purpose, the existing .select method on datasets and elements should accept predicate dim expressions. Additionally, if an element has a .dataset property, then these expressions should be able to reference all of the dimensions in .dataset not only those listed in vdims/kdims.

Histogram elements are a bit special because in their .data property they store the bin edges and bin counts/frequencies. Currently, selection can only be performed on the single key-dimension. With these changes, this key-dimension selection will still behave just as before, but if a .dataset is available then the dataset will also be filtered by the same criteria (this should not require reaggregation). If the selection involves dimensions other than the key dimensions, then this will also trigger reaggregation using the same bins.

Selection nan mode

To handle selections on continuous elements (Curve, Area, etc.) it is important to maintain NaN value(s) where data were rejected by the selection criteria, otherwise it's not possible to break the element properly. Here's an example of the problem that arises with an Area element with the default selection behavior.

area * area.select(y=(0.5, None))

Screen Shot 2019-07-23 at 6 39 24 AM

The proposal here is to add a new kwarg to .select to control how rejected data is handle. Naming is still up for discussion, but something like selection_mode, with three options

'filter': Remove all rows that don't satisy the criteria (current and default behavior)
'mask': Replace values in all rows that don't satisfy criteria with NaNs. This would behave somewhat like the pandas where method.
'nan_join': Replace contiguous blocks of rows that don't satisfy criteria with a single row containing NaNs. This is what the selection framework would use.

area * area.select(selection_mode='nan_join', y=(0.5, None))

Screen Shot 2019-07-23 at 10 28 48 AM

An alternative to this approach would be to deem these continuous elements incompatible with linked selection and remove them from the supported elements list. These are the elements that would not be supported in this case:

Area
Curve
Path
Spread
Path3D (Plotly)

Selection expression from linked stream

Elements should be given a method that inputs an instance of a LinkedStream subclass, and returns a symbolic dim expression that includes the data that would be selected by that stream, or None if the stream cannot be used to select data from the element. For example:

stream = BoundsXY(bounds=(0, 2, 1, 3))
element = Scatter(df, "A", "B")
expr = element.build_expr_for_stream(stream)
expr

(dim(A) >= 0) & (dim(A) <= 2) & (dim(B) >= 1) & (dim(B) <= 2)

There would also be a corresponding method on the stream class itself that delegates to the stream's source element.

element = Scatter(df, "A", "B")
stream = BoundsXY(bounds=(0, 2, 1, 3), source=element)
expr = stream.build_expr()
expr

(dim(A) >= 0) & (dim(A) <= 2) & (dim(B) >= 1) & (dim(B) <= 2)

When build_expr_for_stream is called on a DynamicMap with kdims (sliders), the resulting expression will also restrict to current values of these key dimensions.

Selection Manager implementation

The SelectionManager will be a parameterized class with exprs and colors properties to hold the current selection expressions and current selection colors. A Param stream wrapping an instance of this class will be used as input to the DynamicMap instances that produce the selection overlays.

`link_selections` implementation

The link_selections operation will input a holoviews object and map it into a new object where supported elements have been overlayed with DynamicMaps producing the selection overlays. These DynamicMaps will input the SelectionManager Param stream and will use it to compute the subsets.

Object with supported type

HoloViews objects with a .type property that is a supported element type will be overlayed with a .select selection on the object. This will handle simple elements, HoloMaps, GridSpaces, and DynamicMaps that return a supported element type.

If a DynamicMap is encountered that has not been initialized (.type is None), the initialize_dynamic function will be called on it to make this type information available.

DynamicMaps with unsupported type

When processing a DynamicMap that returns an unsupported element, link_selections will recursively walk through the inputs to the DynamicMap's callback, looking for an object with a supported element type. If one is discovered, then the selections will be performed at that point in the pipeline, and the selected element(s) will be passed through the rest of the pipeline before being overlayed.

This approach will enable link_selections to handle DynamicMaps created by the rasterize and datashade operations.

Overlay vs colorscale

For many element types, overlaying a subset of the original element in a difference color is a good way to display selections. But some element types are better suited for displaying all of the colors for all selections in a single element. Scatter3D is one example. Other examples that aren't supported in HoloViews yet are the Plotly Parallel Coordinates and Parallel Categories plot types. For all of these cases, the best way to represent the various selections is to internally use a discrete colorscale.

To support this use-case, the selection mangager should provide a method to compute an array of the selection index for each data point in an element. So somewhere we'll need to store the information about which selection method is best for each element.

@jbednar @jlstevens @philippjfr

jbednar commented 5 years ago

This all sounds fabulous! My own preference/suggestion is to support selecting isolated chunks of Curve/Area/Spread and to support selecting atomic Path/Path3D/Polygon items (not chunks of them), but I think you should be the one to make the call for how continuous elements are handled, once we've weighed in.

It would also be great to get support for Plotly Parallel Coordinate and Categories plots in HoloViews, with or without a corresponding (less capable) Bokeh or Matplotlib version.

jonmmease commented 5 years ago

Ok, I added some more detail about the elements that are not in the "initially supported" list.

jbednar commented 5 years ago

Thanks!

Can we find a useful way to color subsets so that you can perceive continuous intensity and categorical color independently? If we turn each selection color into a light to dark colorscale then this would probably make sense.

Often this is just alpha; bokeh offers muted_alpha for this purpose. I'm not sure if that's compatible with the approach here; with alpha you need the original plot to be changed, not the selected one, because the original plot is normally at full opacity already. Obviously there are plenty of cases where alpha is already being used or would give results that are ambiguous given the page background, but it seems like in general alpha could show a selection on an image in enough cases that it could be a default behavior.

philippjfr commented 4 years ago

Thanks for all your work on this. I believe this vision has now been realized. Any additional fixes, ideas and suggestions should be in new issues.

github-actions[bot] commented 3 weeks ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

holoviz / holoviews