plotly / plotly.js

Open-source JavaScript charting library behind Plotly and Dash
https://plotly.com/javascript/
MIT License
17.11k stars 1.87k forks source link

[WIP] Crossfilter discussion #1316

Closed monfera closed 5 years ago

monfera commented 7 years ago

Reactive, crossfiltered data visualization

Plotly has originally focused on generating visualizations, and interactivity increased over time. Plotly, by now, has acquired rich layout, style and data update facilities, even animations. Data transformations such as declarative input grouping and filtering have also been added.

As there's growing expectations for fluid, efficient, yet still declarative interactions such as crossfiltering, we are starting a discussion with the purpose of shaping an API in line with Plotly conventions, current practices and future expectations.

Crossfilters are behaviors that let the user subset a multivariate dataset via direct manipulation across multiple views on that dataset. It is also known as linked brushing or linked filtering. The set of views included in one crossfilter is called coordinated views by the crossfilter.js doc, or sometimes linked views. There is no clear-cut boundary for the functional scope and features of crossfilters.

The archetypal crossfilter example by Mike Bostock, author of crossfilter.js, showing multidimensional filtering and aggregation on a quarter million records, also updating a sample table: crossfilterjs

This text is just to start the ball going. There is prior art, surrounding the Plotly toolchain and its dependencies such as D3. Since these tools are in active use and well-documented, this description doesn't detail them, except enlisting them and highlighting some of their properties.

Also, there is a fantastic discussion on the topic by Carson Sievert, including many of the crossfilter concepts. Due to the richness of that material, this writeup can be a bit sparser on the crossfilter behaviors and more detailed on implementation concerns.

It's still useful to start with one way of thinking about interactivity, as crossfiltering is a particular instance of it. Also, crossfilter cores such as crossfilter.js usually peer-depend on change propagation or reactivity. Section 1 may be skipped for directly jumping into the crossfilter-specific part.

reactivity

What runtime changes may occur to a visualization?

Not all types of visualizations require sophisticated updates. For example, a command tool such as the typical use of ggplot2 is technically a single-step execution even if the dataviz maker may repeatedly invoke it with various projections, aesthetics and data. These are common things that need data flow:

Browser standards may cover some of the above items. For example, a CSS media query might provide print layouting; the <title> SVG tag provides basic tooltip hover; CSS supports transitions and animations for HTML and DOM elements. Often, these have limitations: CSS transitions, animations do not work for Canvas and WebGL (and in IE11, even SVG is poorly supported); the tooltip is very basic; sometimes the browsers have bugs, making CSS based layout changes hard or impossible (for example, non-scaling-stroke is buggy in some browser versions, and CSS translations can run into numerical issues).

Therefore, while following the standards is important for accessibility and progressive enhancement, they do not in general substitute for JavaScript execution for dataviz recalculation and rerender.

Why do runtime changes need some data flow concept?

Various terms exist for the need of a data flow concept. Perhaps the most often used term is "reactivity", not to be confused with react, a library that solves some rendering aspects of a reactive UI. The term responsive is sometimes used, although it's often meant in a regrettably limiting sense, such as redrawing on a window resize. There are various technical names such as streams and observables. Below, we'll stick to a generic term "data flow". There are related things like promises, publish/subscribe pattern, observer pattern, all trying to solve some aspect of the data flow problem.

Some visualizations may not really need one

For example, a very simple D3 or react based visualization may just rely on these respective libraries for the initial rendering and update (rerendering). Both D3 and react have been designed to allow idempotent rendering, such that the user may have a simple concept of 'data in, view out' - and these libraries handle the rest. Even in this case, there's some data flow concept, hidden beneath the library, but expressed through the API. In case of D3 there are selections, data binding and the General Update Pattern, involving most DOM-specific API calls such as selection.data().enter(), selection.attr(), selection.transition(). D3 also provides common interactions such as brushing and dragging, as well as simple event dispatch and HTTP request handling. In react, the basic idea is that a pure function maps some data object to a DOM fragment; its underlying mechanism is the DOM diffing via the virtual DOM, and it allows methods for component lifecycle events such as insertion or removal of a node.

Anyway, D3 views are often embedded in some framework that provide data flow functions, and react, or lighter weight alternatives such as inferno and preact are often accompanied with data-centric tools such as MobX or redux.

Also, some use cases simply involve a one-off rendering, for example, outputting a static visualization, with no or basic interactivity features.

Some visualizations do need a data flow concept

A lot can be done just by using the simplest approach with D3 or react, so why go further?

A reminder is that Web standards are often quite limited (browser version limitations, IE feature lagging, no Canvas/WebGL animation support via CSS, more complex dataviz, see above).

One reason is declarative, denotational semantics, letting users specify what the visualizations and interactions should result, rather than how a desired effects are achieved (an operational notion, implementation detail).

Some of the larger, more complex, ambitious data visualization libraries such as Plotly and Vega/Vega-Lite strive to be declarative, letting users tell what the dataviz should be - and this principle has merit even as an implementation concept. Current research is going into making not only visual output but also interactions declarative; sensible due to how much interactivity became integral to data visualizations.

When a visualization gets complex, working with data flow declaratively helps developer understanding and system overview. Even a most basic view, a single line or area plot has a lot of calculations which are best described as relations in a directed graph (annotations added to a vanilla Apple Numbers template): image

For another simple example, consider

Relationships get much more complex if there are lots of lines, projections and transitions. For example, an exploratory tool may allow the replacement of one axis with another, or even the transition from one plot type (e.g. scatterplot) to another (e.g. beehive plot). Then there may be animations, filter, pan, zoom, small multiple or trellised views, multipanel views and dashboards with diverse sets of visuals on them. Being declarative in the implementation means that new time-varying or reactive behaviors may be easier to compose from existing ones, with easier reuse (pure functions), and testing is easier as mocks aren't needed.

Another reason is efficiency, an operational concern which is important for fluidity thus good user experience. An idealized computer would be able to calculate with infinite speed, no impact on the battery life, and we'd have a way of just recalculating everything from direct inputs and the user's interaction history. Actually, this is a bit like the model for the most basic react or D3 use, as well as a main concept of elm and redux time travel, and this works fine for a lot of use cases (we'll consider it a data flow model and come back to its pros and cons later).

But computers are not infinitely fast, so there is a host of reasons for why it's not sufficient in general:

In short, a basic reason for thinking about the data flow is that we want fluid user experience in a world of asynchronous actions, limited CPU and battery power. Janky interactions or avoidance of fluid interactions altogether underutilizes the computer medium and is a competitive disadvantage.

A simple example (follow link for writeup) for granular, incremental recalculations to reflect ongoing configuration on a live, real time updated view, e.g. changing bandline quantiles for outlier-vs-not shading: fluid3 We also expect that morphing from one visual representation (projections, channels, aesthetics) to another is going to become more common, for dashboard building via direct manipulation as well as exploratory analysis, an early Plotly concept morphs from parcoords panel to scatter, preserving filtering: parcoords-to-scatter

Couldn't we solve the problem without some data flow concept? (informal data flow)

We'll categorize such solutions as data flow concepts :-) But here they go anyway:

  1. Function application memoization (caching). Functions that has a big impact in the profiler gets cached, so the next time around, it's a simple lookup. Its benefit is referential transparency and therefore easy testability. Its workings are fully testable by supplying some input and making assertions about its output. Results don't depend on some state. Basic functional programming with or without caching is a kind of data flow concept as data values are transformed by a directed acyclic graph (DAG) of various data transformation functions. The main problem is high risk of memory leaks, especially as current JS is hostile to solving them (no weak references; no explicit GC trigger; no object finalization; no tail-call optimization; ES5 only supports string based maps/hashtables and ES2015 Maps are still slower etc.)
  2. Incremental update. Some state object gets incrementally updated on each new piece of input. For example, newState.min = Math.min(previousState.min, input.newPrice). It's the redux model. It's great for single-layer, relatively simple actions, but isn't that suitable for the type of deeply cascading changes that characterizes data visualization.
  3. Lean on D3 data binding. The data binding, especially with keyed selection.data() functions and carefully tailored enter vs update discrimination, is a powerful way for SVG visualizations. For example, it's possible to enhance an initially raw dataset with expensive aggregate statistics, and run a recalculation only if needed (e.g. a new point is added), which requires that the key function incorporate the data array length or some surrogate (hash etc.). Limitations: large DOM trees may be slow; more convoluted, and rigid, less component oriented design; data needs to be naturally hierarchical or otherwise crosslinks are needed; easily introduced bugs when a recalculation isn't done though it should be, or the other way around. Canvas support is doable but somewhat convoluted.
  4. Lean on react lifecycle methods. The lifecycle methods make it possible to compute things just once. But model calculations are an anti-pattern in react; even the presence of lifecycle methods remove quite a bit from the react philosophy; and the issues mentioned for D3 above also apply.

Again, these approaches work, and can be very compact and natural to use, but they don't scale well to complex visualizations. Now on to some alternatives that are often used for larger projects:

  1. Manual update processing. It's often that a tool starts its life with a set of expectations such as single-pass rendering and then must assume more and more dynamic functionality. Initially, there are some objects onto which the result of expensive or repeated calculations are hung. Then, upon adding update functionality, there are methods that take new input and some of the previous state and update various object properties. Usually, there are some means for change propagation, e.g. via the observer pattern or the pub/sub pattern (discussed separately). The drawback is that it's in essence, a manual, informally specified way of doing caching, which is error prone to not only overeager recalculations or worse, missed ones (stale props), but also, make things hard to test, because state is a leaky abstraction and once the state can be altered by other units and methods, there's a combinatorial explosion of what might go wrong. Adding a new feature or refactoring requires knowledge of much implementation detail; missing these may result in broken things even if test suites pass. Refactoring can also incur friction if the test suite boundaries relate to such state (implementation detail) rather than effect (e.g. resulting DOM contents or better, visible output). The appeal of the manual update processing is that the code looks traceable: there are no magic mechanisms that need to be learned, just plain JavaScript everywhere. It's easy to debug at a micro level, by putting in a console.log or a debugger statement. In contrast, sophisticated approaches require a good amount of learning and debugging practice (non-trivial costs).
  2. Model-View-Container or derivative patterns (MVVM etc.). Though it sounds more authoritative than manual update processing, most problems are shared even if some disciplines allude to a formal approach. Also, in data visualization, separating model from view or view from controller is not trivial. In the case of MVVM, the separation of model and viewModel is also a bit arbitrary. MV also typically uses some data binding pattern, e.g. observer or pub/sub. There's also a competition among the MV zoo and the definitions aren't clear enough to even firmly know which is which.
  3. Pub/sub and observer patterns. A lot has been written about the disillusionment caused by these patterns. Most of these are centered around the fact that while components become appealingly decoupled at the source code level, they turn out to be semantically coupled in all sorts of ways.

The above three approaches have the common problem that they can lead to overreliance on tribal knowledge. There are no hard and fast rules or protocols about these approaches; they're grown organically (manual update processing) or are vague guidelines that leave the details up to debate and an endless stream of 'best practices' books. Often, the data flow code (if this separate aspect is kept as separate code) is developed in-house, and lacks proper documentation.

I think the lack of reusability comes in object-oriented languages, not in functional languages. Because the problem with object-oriented languages is they've got all this implicit environment that they carry around with them. You wanted a banana but what you got was a gorilla holding the banana and the entire jungle.

Joe Armstrong

  1. Using a comprehensive framework such as Angular. With Angular 1, there is two-way data binding, and with Angular 2, RxJS is incorporated (discussed separately, as it's been an established library on its own). Both angular versions are rather large, opinionated frameworks with idiosyncrasies, and neither is quite efficient for dataviz. It's unclear which of the two Angulars will be more popular. Similarly, as react is not a comprehensive framework, its complementing (independent) data flow tools are discussed on their own right.

Data flow tool categories

The below list includes a few specific libraries, not meant to imply that Plotly should follow or use any of these specifically.

A. Object-centered approaches

Usually, operations are done to objects via method calls, and methods achieve effects via altering various objects. It is hard to establish causal links: during debugging, one can't often get to a root cause just by traversing up the call stack, since the failing calculation fails likely because some of its input object properties are wrong, but those properties were not set in a frame currently on the stack, but some unknown different stack that preceded the current execution. In addition to familiarity with the API, a lot of implementation detail needs to be known to a contributor. Data is often exposed on objects, which commits the solution to particular representation structures, an operational rather than declarative concept. The flow of the data is implicit in the code and hard to have a mental image of.

  1. No formal approach: the code reflects a gradual evolution from an original code base that didn't stress interactive features to a code base that's expected to respond fluidly
  2. MVC pattern
  3. Plotly relayout / restyle - idempotent plot update

B. Special-purpose data flow tool: low-level, idempotent, data-driven renderers

Some view generator solutions have their built-in data propagation patterns, such as data binding, which are fairly powerful, yet not quite appropriate for complex functions such as a crossfilter. Also, these tools themselves don't scale well to a moderate number of DOM elements for executions as frequent as the animation frame (60FPS).

  1. Leaning on D3 data binding and frequent, on-event rerendering for dashboard-level data flow
  2. react component tree; often, lifecycle methods and stateful components
  3. react alternatives with smaller scope and minimal footprint (inferno, preact, react-lite...)
  4. regl, inspired by react, transforms specifications to efficiently generated and executed WebGL API calls (Plotly parcoords already uses regl.)

C. Special-purpose data flow tool: pipes

These tools usually facilitate one-off execution of a sequence of data transformations, sometimes including side effecting processing steps or terminal nodes. Due to their one-off nature, they're often built to handle explicit, e.g. command line execution, or individual input events synchronously or via promises. The archetype is the unix pipe. Usually, branching is besides the scope or very limited, therefore it's not as natural for handling diverse inputs that factor into various points in the series of transformations, or intermediary transformations that take data from and/or feed into multiple other transformations.

  1. Unix pipes
  2. magrittr in R (often with dplyr)
  3. Fluent-style method chaining in various libraries (d3, d3fc, RxJS)
  4. Promises
  5. Ramda.js compose / pipe

D. Special-purpose data flow tool: crossfilters

Crossfilters usually want to efficiently and scaleably solve the problem of multidimensional selection of individual data points for fast querying of he resulting sets or their aggregations. They typically process filter range changes and even new data points incrementally. Usually, processing is done with reducer functions, efficient if the incremental change is of limited frequency, but not as efficient when the changes are big enough to warrant for a tight, cache aware numerical processing loop. They often do not want to provide a mechanism for notification, whether it's related to their input (new data or interactions) or output (downstream changes on the changed itemized and aggregate queries), so a crossfilter, on its own, isn't sufficient for crossfiltering; it needs to be embedded in a more general data propagation mechanism. Internally, crossfilters use interesting implementations for efficiently updating query sets, and are rather stateful so as to save computational costs for handling incremental changes with low latency.

  1. JS based:
    1. crossfilter.js
    2. vega-crossfilter
    3. scijs/cwise based (idea: turn reducer functions into an efficient loop body)
  2. WebGL based
    1. plotly vertex shader based mini-crossfilter as in the new Plotly parcoords
    2. regl-cwise based (idea: turn reducer functions into shader code and hierarchical aggregations)

E. General data flow tool categories

These can be thought of as a spreadsheet, in that the developer doesn't state how a sum is calculated and updated: whenever some input changes, it propagates downstream in the directed acyclic graph that is the data flow structure. Yet, proper FRP, coined by Conal Elliott, has a rigorous foundation so we call JS libraries FRP inspired, as they center around operational concerns such as a data propagation graph, event emission, backpressure etc. While sound in principle, many of these libraries make it hard to debug userland code, because the stack is usually deep, verbose, nondescript and even with blackboxing, it's hard to see what initial change cascaded down to the current stack, and what transformations took place. MobX puts more emphasis on letting the coder understand cause and effect relationships in the debugger.

  1. Object magic based
    1. MobX
  2. FRP inspired libraries
    1. RxJS
    2. Bacon
    3. Kefir
    4. Flyd
    5. most.js
    6. xstream
  3. Real FRP libraries Motives and properties recapped here Libraries not listed as none currently exists for JS

    F. Reducer based

Redux is a predictable state container, a reducer based library. It handles singular changes, called actions, elegantly and in a functionally pure way, responsible for the predictability part. Each action is mapped into a transform of a (current) state to a next state; the state object itself is modeled as a large, inert JSON-like object, whose hierarchical structure can represent inputs or derived data. Since redux handles direct actions and doesn't in itself handle the rippling effects of such actions, it's combined with change propagation means for deeper dependency graphs.

  1. redux only
  2. redux-saga
  3. redux-observable

G. View and logic together

These tools bind some data propagation concept / tool with a view rendering mechanism such as DOM updates. They can be made to work on Canvas/WebGL, though in this case the benefit of being cycle-oriented is somewhat underutilized.

  1. Elm (transpiles to JS)
  2. Vue.js
  3. Cycle-like
    1. cycle.js
    2. motorcycle
    3. TSERS (few recent commits)
  4. dc.js

2. Crossfiltering

Crossfiltering is a major data visualization interaction type that lets the user slice and subset their data, most often by highlighting a range on an axis or an area on a plot. An archetypal implementation (for me, having used it first) is Bostock's crossfilter.js published in 2012.

Interactivity in data visualization is only limited by creativity and practicality. Yet, there are archetypal interactions that can be easily identified in literature and implementations alike, such as

The latter is often called crossfiltering on a multi-plot view when the purpose of selecting elements or a range of elements is not primarily to get detailed, itemized info on them, but to control what is shown on the other subplots, conveying the notion to users that they interact with a single dataset, filterable in any of the interactive subplots, all of which provide a particular view into the single dataset.

Crossfiltering is an important solution for what we can term as the big problem of data visualization: the focusing problem. Crossfiltering lets the user start exploratory analysis by viewing the visualization based on the entirety of the data, or a pertinent set (e.g. last 30 days), but then focus on subsets of data, guided by their goals and patterns in already rendered subsets. It is also usable in explanatory analytics such as interactive journalism or education: the reader or student may gain useful extra information using the same set of views, altering just the set of data in scope, e.g. selecting his city of residence or highlighting an interesting range of distance.

Common crossfiltering facilities - overview

Interactions:

Responses:

Crossfilter implementations

To inform crossfilter API design, it's useful to touch on current, actually avaliable crossfiltering methods. Features are enlisted so that the common, and perhaps some rare functions are input to API design. Similarly, current limitation - subject to getting obsolete - are mentioned not as criticism, but simply to gauge the extent its API has needed to cope with planned use cases.

Crossfilter.js

Crossfilter is an in-memory, incremental mapReduce implementation in JS created by Mike Bostock who also authored D3.

  1. Have a bag of opaque objects
    • you can add them in bulk
    • you can add further ones later
    • you can't remove them once added - the solution for removal, if needed, is to equate objects with transactions (e.g. instead of adding bank movements, add bank transactions, where a subsequent transaction can invalidate an earlier transaction)
  2. Have some dimensions (attributes or virtual fields on the objects)
  3. Have some aggregations determined by a dimension and add/remove reducers
  4. Can filter on arbitrary dimension
  5. Can get:
    • group aggregates in line with current filters
    • groupAll
    • group element counts
    • group top/bottom K elements
    • group constituent elements are basically top or bottom infinity

Possible gotcha: "a grouping intersects the crossfilter's current filters, except for the associated dimension's filter. Thus, group methods consider only records that satisfy every filter except this dimension's filter. So, if the crossfilter of payments is filtered by type and total, then group by total only observes the filter by type"

Key features

Very small (10kb uncompressed, 4.4kB compressed) Very mature and stable Fast for large datasets e.g. 100k elements, if reducers are fast (though obv. not as fast as array looping) Does one thing and one thing well Small API surface

Limitation of scope

These are inherent either in the focused scope of this component (do one thing well), or in the JS language and runtimes (no weak maps, no object finalization etc.) so they're just observations rather than criticism.

Vega crossfiltering

Vega is an interesting, long-running project run by the Interactive Data Lab; its approaches demonstrate important research, and there's a level of rigorousness and compactness about the concepts. Vega implements a visualization grammar (see also Wilkinson's Grammar of Graphics, ggplot2), a declarative format for creating interactive visualizations.

Vega is based on reactive data flow, and has enabled the creation of crossfiltering, although not in a particularly declarative way. The award winning research paper describes the addition of declarative graphics interactions.

Vega

Example: https://vega.github.io/vega-editor/?mode=vega&spec=crossfilter Depends on vega-dataflow and vega-crossfilter.

Vega is a reactive library of broad, general data visualization scope. Uses its own reactive data flow means rather than depending on another lib. 342kB uncompressed.

While Vega supports crossfiltering in that reactive streams causing a crossfilter mechanism can be established, the creation is somewhat intricate, and isn't a concise, high level, declarative API.

Vega-dataflow

Dependency of vega-crossfilter and vega. Streams scalar and composite data.

https://github.com/vega/vega-dataflow

Relatively large, bundle is 88kB uncompressed.

Vega-crossfilter

https://github.com/vega/vega-crossfilter/blob/master/test/crossfilter-test.js

Uses vega-dataflow but doesn't use Bostock's crossfilter.js Dependency of vega

Vega-lite

Vega-lite is a translation layer between the Vega-Lite compact, higher level visualization grammar format and the powerful, more verbose Vega visualization grammar format.

As of January 16 2017 there's no crossfilter or declarative (or any) interactions; declarative interactions are currently in feature branches and slated to arrive soon. If I'm not mistaken, even with declarative interactivity in Vega-lite, it won't be as simple as identifying dimensions and subplots for a crossfiltering relationship. But at the expense of more verbosity, there'll be more flexibility as well, permitting custom and hybrid interactions.

devDepends on vega.

Crosstalk (htmlwidgets)

Crosstalk is a protocol for linked brushing across multiple, possibly heterogeneous htmlwidgets. It uses shared state (SharedData) among various htmlwidgets. A htmlwidget can be made compatible with crosstalk by following a well-documented protocol.

Limitations (as of writing; evergreen doc):

Bokeh crossfilter

Bokeh has a crossfilter, also referred to as linked brushing that redraws subplots upon the completion of the selection, and the rectangular or lassoed area doesn't persist, therefore cannot be interactively moved. It is a possible way of bypassing stringent latency requirements, and is a useful option to consider for an initial Plotly implementation. bokeh Interestingly, the seen Bokeh examples have no explicit crossfilter specifications beyond enlisting the interaction start buttons box_select, auto_select. According to the text, the only other criterion is that multiple plots use the same dataset (same identity). It has a lot of appeal by virtue of its simplicity, although Plotly, given its numerous connectors, serialized tree representation and granular data structures probably can't follow this model. Yet, it shows that the API search space should include very terse or implied linking. In the absence of relying on dataset identity, a closest option would be simply to add a filtergroup attribute to all plots (see below).

Upshot

Crossfilter API design thoughts for Plotly

Based on the above landscape and some motives below, as well as strong, preexisting Plotly API conventions that have been found useful by a wide base of users, we can start assembling thoughts on possible crossfilter API elements for Plotly.

For simplicity, the term Plotly means Plotly.js here; all the language API bindings and the Plotly Workspace would likely expose the crossfilter specifications to their respective users.

Existing interactive and related features in Plotly.js

Plotly already supports interactivity and data processing features that relate to crossfiltering:

Currently limiting features in Plotly

Understanding prior art

Desired functional features in the Plotly crossfilter

Flexible data subsetting in crossfiltering

Specification for

Diverse selection sets and filtering algebra

For compact, common representation, both enumerated values and contiguous ranges are ideally supported. We may consider

An initial implementation is already useful with one simple, single range based filter per dimension, as done for parcoords.

Aggregations

Some crossfilters, e.g. R's crosstalk, may only (currently) support crossfiltering over atomic data. It is already useful, since it can yield linked brushing. Going beyond this, most crossfilters support the inclusion of groups or aggregates. Selecting a subset of the scatter points may lead to updated histograms similar to this dc.js example.

In addition to updating aggregates, it is desirable if projection ranges (brushed areas) or glyphs corresponding to aggregates, such as histogram bars or choropleth maps, are themselves subject to selection. For example, highlighting a range of bars on a histogram would highlight the source scatter points, and other aggregates would be updated based on this highlighted set of scatter points (link below does this too, relying on crossfilter.js in dc.js).

This reverse direction requires an explicit bijective relationship between an aggregate plot and the source data, otherwise the corresponding atomic data points can't be identified. I think Plotly doesn't yet handle this aspect, but again, aggregates, especially the selection of aggregates need not be part of an initial step. Plotly currently handles few, discrete types of aggregations, such as binning for histograms, so adding inverse mapping doesn't seem burdensome. More challenging is that users do, or are lead to preaggregate data themselves to make their own aggregations, in effect, using Plotly as a dumb, static view with the data processing steps residing outside Plotly - in this case, establishing links is impossible, unless we invent some heavyweight annotation for bijective mapping. Consequently, the Plotly API would need to move more into data handling territory with datasets, dimensions and aggregation keys as first-class JSON structures; then individual plots or traces may refer to said datasets as their data, and dimensions in their axes, as opposed to the current practice of supplying data directly to the traces.

Many dashboards in the wild display solely aggregates (no items in sight). It's good to consider an API with at least eventual aggregation support in mind.

Some other dashboards such as an implementation of Stephen Few's student dashboard in d3 feature itemized data selection, updating aggregates, where each item itself is composite, e.g. a student that's a foreign key in a per-student attendance time series table: dashboard-crossfilter If sorting is present (analogous to using Plotly.restyle with a different order for ordinal ticks), the previously contiguous selection range becomes fragmented (or conversely, we may use an ordering-then-brushing facility to avoid complications with multiple set selections), yet the aggregation itself doesn't change: reordering

Familiarity

Lots of good work went into crossfilters in JS and other languages via the above mentioned libraries and lots of libraries not mentioned here. To make things easy for users, our design should recognize established, learnt patterns. Since the concepts are fairly transferable across tools, yet the actual behaviors, limitations, method and granularity of specification is diverse, it's best to follow the concepts and do it in a way that's coherent with Plotly patterns, on principle of least surprise to the users.

Time series data

It's often the case that crossfiltering is combined with, or applied to time series data. This poses additional demand, because of the data points and especially DOM impact involved. A headroom in smooth rerendering performance may be achieved by hybrid charts where the single or few performance critical layers are rendered with WebGL e.g. via regl: image There are additional use cases with time series data:

Animating filters

It's useful for animations to also work with crossfiltering, enabling that a single dimension filter is declared for animation, yet the visual effects show in all the rendered plots that involve the filtered dimension.

Desired non-functional features

Serializability

Low latency

The lower the latency, the better - the ideal is 30FPS-60FPS. If it's worse than around 10-15FPS, it eliminates the illusion of direct manipulation, which often underpins crossfiltering, and the users need to wait for debounced, delayed recalculations, i.e. views are out of sync. Therefore it's important to optimize data paths in some systematic manner, or settle with deferred view update.

Low latency has lots of elements: efficient filtering code (e.g. crossfilter.js uses heavy bit fiddling; our parcoords crossfilter in the vertex shader); avoiding unnecessary recalculations, since the changes may be very cheap to calculate compared to an initial rendering; touching the DOM sparingly, using e.g. d3selection.data(fun, key) to detect changes and rely on the DOM diffing of the General Update Pattern.

Reusing existing Plotly facilities

A lot of existing Plotly facilities may be reused for crossfiltering. plotly lasso square

A sample API for simple, atomic crossfiltering

Unlike Bokeh, Plotly can't currently rely on a single, shared data structure to deduce a default crossfiltering behavior. Also, current axis keys (keys of the JSON object) can't serve to indicate dimensional unity, because of their preexisting separation for e.g. layouting in screen space (called domain in Plotly).

But there would be ways for retaining the current Plotly semantics and API, while introducing datasets as first class objects.

Establishing unity of data and dimensions can be done by modeling these as first class entities. It would yield a compact, scalable and high level representation.

What looks like this now, with repeated vectors for disparate plots or traces:

{
    "data": [
        {
            "filtergroup": "cf1",
            "x": [1, 3, 2],
            "y": [4, 5, 6],
            "type": "scatter"
        },
        {
            "filtergroup": "cf1",
            "x": [1, 3, 2],
            "y": [50, 60, 70],
            "xaxis": "x2",
            "yaxis": "y2",
            "type": "scatter"
        }
    ],
    "layout": {...}

may be, in order to preserve relations, represented as

{
    "datasets": {
        "iris": {
            "petalwidth": [1, 3, 2], // analogous to x, y vectors but shareable by name
            "sepalwidth": [4, 5, 6],
            "petallength": [50, 60, 70],
            "species": ["setosa", "setosa", "versicolor"]
        }
    },
    "data": [
        {
            "filtergroup": "myCrossfilterGroup1", // multiple crossfilters are possible
            "x": "iris.petalwidth", // just referencing the actual data
            "y": "iris.sepalwidth",
            "mode": "markers",
            "xaxis": "x",
            "yaxis": "y"
        },
        {
            "filtergroup": "myCrossfilterGroup1",
            "x": "iris.petalwidth",
            "y": "iris.petallength",
            "mode": "markers",
            "xaxis": "x2",
            "yaxis": "y2"
        }
    ],
    "layout": {...}
}

In addition to retaining data relations, it would have other benefits:

Tools surrounding Plotly.js, such as the Workspace, already have analogous facilities, so it can be considered a natural absorption of useful features into Plotly.js.

API possibilities for grouping

Groups, in general, can be many things: nodes in a normalized relational star schema model; or calculated on the fly, such as specific bins; or in the simplest case, just another dimension (denormalized representation). There's ample precedent for this last option in Plotly, such as the current transforms/groupBy specification, or the use ordinal or nominal dimensions (e.g. overplotting points with semitransparent markers).

Therefore, groups might be specified, quite verbosely, as

{
    "datasets": {
        "iris": {
            "dimensions": {
                "petalWidth": [1, 3, 2], // analogous to x, y vectors but shareable by name
                "sepalWidth": [4, 5, 6],
                "petalLength": [50, 60, 70],
                "species": ["setosa", "versicolor", "versicolor"]
            }
        },
        "mySpeciesAggregate": {
            "dimensions": {
                "avgPetalLength": {
                    "sources": ["iris"],
                    "transforms": {
                        "groupBy": [{
                            "key": "iris.species", // or alternatively, a vector in place
                            "value": "petalLength",
                            "aggregates": {
                                "average": "mean" // assuming there's a Plotly-defined set of aggregations like in SQL
                            }
                        }]
                    }
                }
            }
        }
    },
    "data": [
        {
            "filtergroup": "myCrossfilterGroup1", // multiple crossfilters are possible
            "x": "iris.petalWidth", // just referencing the actual data
            "y": "iris.sepalWidth",
            "mode": "markers",
            "xaxis": "x",
            "yaxis": "y"
        },
        {
            "filtergroup": "myCrossfilterGroup1",
            "x": "mySpeciesAggregate.species",
            "y": "mySpeciesAggregate.avgPetalLength",
            "mode": "markers",
            "xaxis": "x2",
            "yaxis": "y2"
        }
    ],
    "layout": {...}
}    

Recognizing that this is a lot of text for mundane aggregations, supposedly coming from a list of Plotly-implemented aggregation functions, it should either be made much briefer, or the reward should be a lot more power, for example, some plugin mechanism for custom, programmed aggregator and filter components, even if the API doesn't go to the length Vega goes, which encourages infix and functional algebraic expressions represented as strings.

An example for the former option - much briefer notation - could be a simple reference to the aggregation in the view:

{
    "datasets": {
        "iris": {
            "petalwidth": [1, 3, 2], // analogous to x, y vectors but shareable by name
            "sepalwidth": [4, 5, 6],
            "petallength": [50, 60, 70],
            "species": ["setosa", "setosa", "versicolor"]
        }
    },
    "data": [
        {
            "filtergroup": "myCrossfilterGroup1", // multiple crossfilters are possible
            "x": "iris.petalwidth", // just referencing the actual data
            "y": "iris.sepalwidth",
            "mode": "markers",
            "xaxis": "x",
            "yaxis": "y"
        },
        {
            "filtergroup": "myCrossfilterGroup1",
            "x": "iris.species",
            "y": "iris.petallength",
            "aggregation": "mean",
            "mode": "markers",
            "xaxis": "x2",
            "yaxis": "y2"
        }
    ],
    "layout": {...}
}

API for filter state; other API elements

Compared to representing data relations such as shared data and aggregations, the problem of representing and serializing filter states is quite trivial. It just falls into place once these larger problems are resolved. The crossfilter.js API doc contains sensible options, such as using [from, to] filter domains, or [elem1, elem2, ...] enumerations for specifying filter state. Inspired by this, Plotly may add

filtersets: [0, 2, 3, [7, 11], 15, [17, 20]]

though some questions remain, such as whether the ranges, denoted with arrays, are right-open or right-closed.

An alternative is to use the relations similar to the current filter transforms, building up the filtered set more verbosely but perhaps giving more flexibility:

transforms: [
    {
        type: 'filter',
        operation: '>',
        value: 0
    },
    {
        type: 'filter',
        operation: '<',
        value: 100
    }]

though there'll need to be more algebra such as specifying unions and intersections.

Draft conclusions

Adding crossfilter to Plotly is sought after, given the current level of interactivity, the user expectations toward data exploration and Plotly facilities, support for heterogeneous subplots/dashboards, and upcoming plots that need to use crossfiltering such as parcoords, small multiple charts, SPLOM and trellised plots.

A Plotly crossfilter would benefit from (depend on) concurrently introduced new concepts, such as

  1. a dataset concept, to retain relationship among disparate views to the same data, currently represented redundantly and with loss of association - this seems hard to avoid or work around
  2. declarative, bijective aggregations, even if they come from a list of predefined Plotly aggregations based on the most common usage (mean, median, IQR, count, domain, variance, bins, ...) - provided aggregations be supported
  3. a data flow concept, for clarity, and to minimize unnecessary recalculations and rerendering - a useful first version may not need it, but a state of the art version likely would

As this list contains elements on which current libraries have iterated for years - such as LINQ.js for specifying aggregates and other derived queries, not to mention host language features and common libraries in R, Python etc. such as the very compactdplyr API - a question is, where should boundaries be drawn, whether the API of an existing tool should be adopted, or whether it's possible to postpone the introduction of such concepts altogether.

Also, the listed changes may require some refactoring and API change (or addition) such as

micahstubbs commented 7 years ago

๐Ÿ‘ a fantastic tour of the Javascript data flow landscape, along with some very useful Why use this extra abstraction at all? question answering.

curran commented 7 years ago

I feel like the core of all this data flow goodness is topological sorting.

If only there were a way to declaratively specify functional reactive dependencies, and have a topological-sorting-based engine evaluate the parts of the dependency graph that change over time (max 60 FPS), this may solve the issue of arbitrary crossfilter interconnects between visualization components.

MobX is one nice implementation of the above concept.

The Redux pattern does not use topological sort. If you look deeply into solving the problem using RxJS, Bacon, or other FRP libraries, you'll find that the constructs they implement for this (usually called "when") also do not use topological sorting, so they fail on certain cases that come up in complex visualizations.

image The minimal data flow case not supported out-of-the-box by Redux or most FRP implementations. The value of "e" will be set twice, the first time with an inconsistent state.

Possibly a good place to start with implementing the data flow concept in visualizations would be the D3 margin convention, with responding to resize.

image Data flow diagram for the margin convention.

I have done some work in this area, though the libraries never saw wide usage:

I'm not sure why these never took off. They solve the core problems, but there must be something hindering adoption. Perhaps the abstractions introduced are too heavy to add as a dependency. Perhaps it's because (unlike MobX), these libraries are not designed from the getgo to be integrated with a strong component model like React.

In any case, thank you @monfera for your thoughts here. It's an interesting read for sure. I'm interested in solving the same problems - Grammar of Graphics + interactions + crossfilters + data flows.

curran commented 7 years ago

Here are some examples that relate to concepts discussed here, possibly interesting to look over:

curran commented 7 years ago

Also you mention dplyr - there is a nice JavaScript library inspired by dplyr called Plywood.

Datalib also has nice implementations for aggregation.

monfera commented 7 years ago

@curran thanks for your note and links. Yes, topological sorting is useful for synchronous updates. flyd purports to use topological sorting; also, kefir and most.js worked out just fine in practice, even with quite involved use cases such as kinetic scrolling with all kinds of real time data and configurability updates. While synchronous operations are better for guarantees and throughput results, certain things need to be asychronous e.g. incremental rendering, timed events such as transitions and animations, or the current promise returning Plotly API calls.

A note is that we're investigating various things and there is no rushed decision toward anything (in-house, or FRP inspired, or even whether it'll be different from the current patterns) as plotly.js is an established, larger library that covers a lot of ground.

Since you mention your tools haven't quite took off, some plausible reasons, unrelated to their technical qualities:

So it might be that one-off projects choose from one of the established libraries, and larger libraries (vega, highland.js) roll their own take that are tailored to their needs and patterns.

monfera commented 7 years ago

@curran wow you've got a fantastic paper on this! Very illustrative:

image

cpsievert commented 7 years ago

Thanks for the rich discussion @monfera, very informative!

Let me clarify/expand on some points related to crosstalk and the linked views infrastructure within plotly (the R package). At it's core, crosstalk's' JS library just provides a "standard" way to set/get values and emit events when their values change (it's a bit like flyd). Crosstalk itself makes no assumptions about the Data/Model -> View update logic -- it's on the htmlwidget author to implement that part. For that reason, it doesn't really impose the limitations you mention -- in fact, plotly already has aggregates, various ways to trigger selections (i.e., more than just brushing), and other fancy stuff like persistent selection with a dynamic color palette or even "nested" selections -- all of which is built on the crosstalk model. However, it is true that these more advanced examples (beyond what I call transient 1-to-1 linking) are only guaranteed to work when linking 2+ plotly graphs. I have submitted a PR to leaflet to support persistent selection to enable this, for instance, but it's not clear whether we'll ever converge on a common set of "selection options" that every crosstalk-enabled widget should support.

For some more context, this diagram lays out the general idea of how plotly & leaflet are linked via crosstalk (which enables R users to do stuff like this without any knowledge of plotly.js or web technologies):

crosstalk

As for the actual implemetation on the plotly end, a long time ago I decided that the update logic in plotly should favor abstraction over speed (i.e., theData/Model -> View logic is handled at quite a high level via restyle()/addTraces()/deleteTraces()). One nice result of this approach is that dynamic aggregation of selections just work, and if plotly.js decides to add more "statistical" trace types (i.e., some aggregation of the raw data is performed by plotly.js to produce the view), those should also just work.

Of course, in order to implement, I've also had to add my own JSON spec for defining links between traces. Right now, it's based on key/set attributes (plus, attributes I pass along from the highlight() function):

{
  x: [1, 2, ...], 
  y: [3, 4, ...], 
  key: ["a", "b", ...],
  set: "group1"
}

When a desired event is triggered, I subset every trace matching the relevant set (obtained via the event data), then call Plotly.addTraces() with the subsetted data. Also, to link animated views, I also update frames in a similar way.

This is getting a bit into the weeds, but I should also mention that I support different classes of key arrays. A nested key is actually a 2D array and enables "hierarchial" selection like in this video, or more generally, it provides a way to attach multiple values to a graphical mark (which may be desirable, for say, a bar chart). A simple key basically says "highlight this entire trace" (without searching/subsetting), and "emit" the entire key associated with any selected traces (which makes this sort of linking computationally feasible). For more details about these different key definitions, and why they're useful see this slide deck.

monfera commented 7 years ago

@cpsievert thanks for your incredibly useful and detailed comments! The way you summarize it is evocative of something like a 'crossfilter protocol' where relationships are established, as usual with plotly, in a declarative manner, imposing relatively few constraints on the implementation, and allowing a good level of interoperability with components that do not originate from plotly.js.

I've already been thinking about using 'duck typing' inside the implementation. As an example, using crossfilter.js, or similar, latency-optimized in-memory database, would be desirable for some applications, but unnecessary payload increase for others, where data quantity is low, or selections are not occurring in a rapid and incremental manner, or group aggregates do not lend themselves to the reducer based incremental updates. For a first version, we may not need to support accelerators like this though, as a lot can be achieved with an initial version that adds no new sizeable dependency.

Similarly, some applications would need to use asynchronous operations, e.g. fetching new payload (such as facet data, carto or temporal data, queried from a very large server database) from a server, computing something in a Web Worker, or phasing some dependent widgets in and out of the exploratory or dashboard UI without blocking user interaction. But it's not worth baking in a sizeable dependency (or any) as a lot of use cases just won't need it.

Extending on your thoughts and the general utility of not baking in implementation detail, it looks feasible to stick to established declaration patterns such as the way you already specify links, to help with interoperability and pooling of resources. So to summarize, these general goals appear interesting:

  1. Reuse of existing plotly.js facilities such as grouping, filtering, animation, selection, streaming etc. It's the 'just works' factor you refer to, and also, we can get so much functionality out of the current plotly.js code.
  2. Making interoperability with existing crosslinked components easy, via sharing concepts, possibly JSON structures and eventual automatic generation of wrappers. There's a huge swath of existing components out there that have already been wrapped successfully for crosslinked operation.
  3. Declarative API to not bake in implementation detail that would constrain future iterations
  4. Straightforward, no-nonsense initial implementation
  5. Adding minimally to the bundle size: sizeable dependencies should either be not included, or after careful thought, optionally
  6. Allowing custom components to be included: for example, the user may roll their own analytics aggregation, or their own D3 based or other widgets (the latter is related to goal 2)
  7. Low latency: a naive implementation even with just a few widgets and a small dataset can interact sluggishly, depending on the interaction type; some of our widgets, e.g. the WebGL backed ones (scattergl, parcoords, surface, scattermapbox) support larger datasets, or there may be a high number of (sub)plots on the screen. Freezing UI is also a usability concern. I recognize that some plotly.js components aren't yet optimized for low-latency, non-blocking data updates, but we should make such future optimizations possible, even if said component is part of a crossfilter ensemble.
  8. Potential reuse in future Plotly integrations, e.g. Dashboard or even R/Python bindings.

I'll run some experiments and circle back on this one. Thanks again for your detailed comments!

monfera commented 7 years ago

Awesome community notes from OpenVis talk about reactivity, including the announcement of d3.express by @bostock https://docs.google.com/document/d/14mO4HtAw8ewJwwYkS8gkBLDa5csatLtFai1KKXIU0c0/edit# (ht @curran)

chriddyp commented 7 years ago

@monfera - Another thing to keep in mind is whether crossflitering could work across divs (across multiple graphs) instead of just in one context through subplots.

In contexts like plotly dashboards, dash (shiny for python), shiny, and (eventually) our chart marker, it's easier for the user (and/or developer) to build interfaces that involve multiple plots as separate graphs in separate divs instead of a single plotly context with multiple graphs in subplots.

This isn't a hard requirement, it's just something to keep in mind and may make cross filtering applicable for more applications

monfera commented 7 years ago

@chriddyp awesome, thanks for bringing up component scope. Indeed, there should not be boundaries such as subplots and <div>s, I've handled it a requirement without being aware of it.

Users may have outside widgets or views such as a data table, numerical aggregates (e.g. showing a currently filtered total counts/values), a custom SVG map etc. as well as outside controls for filtering, all of which would need to be linked into a spreadsheet-like directed acyclic graph (reactive data flow).

It'd be useful to leave open the integration with other reactive kit, e.g. vega, d3.express, the JS part of crosstalk, as well as making network requests (the entirety of data isn't always practical to preload for various reasons, I'll have an example).

Since Plotly is declarative, it should be possible to set filter/selection state, and request callbacks, analogous to the current events for receiving notification. Looking fwd to the next steps in discussion and prototyping/PoC, some of which is under way.

Probably the chart maker would also benefit from some hooks, because in the past, individual plots/traces got fairly independent data and various plots use disparate data formats containing only the specific data parts (e.g. for a scatterplot, just 2 dimensions out of a possibly multivariate dataset that's also source for a parcoords or another scatter).

In the crossfiltering concept, there is a notion of data as one or more separate 'repositories' from which plots/traces feed, with the mediation of the current filtering etc. control state. IOW the crossfilter would act as a data flow glue among the currently more independent parts.

monfera commented 7 years ago

No two days elapsed since the announcement of d3.express and another major effort is announced, by @mathisonian: https://idyll-lang.github.io/

rreusser commented 7 years ago

(with experimental plotly component: https://idyll-lang.github.io/idyll-loader-component/examples/basic/build/ )

screen shot 2017-04-26 at 18 23 01
curran commented 7 years ago

This video Vega-Lite: A Grammar of Interactive Graphics by @arvind has a great overview of the interaction and multiple-view techniques available currently in vega-lite.

monfera commented 7 years ago

Linking @mbostock's writeup on d3.express: https://medium.com/@mbostock/a-better-way-to-code-2b1d2876a3a0 - has a hefty section on reactivity (thanks @cpsievert for the tip)

rreusser commented 7 years ago

Though now that I've had a chance to read about it, express is for exploration. idyll is really just aimed at presentation ๐Ÿ˜„

rflow commented 7 years ago

hi @monfera - i'm late to this (it took a nudge from @micahstubbs) but would love to discuss further!

i'm working on similar stuff for Riffyn and am going to be talking about it at PlotCon. if you're around in Oakland, t'would be great to grab a coffee and swap notes ๐Ÿ˜‰

domoritz commented 7 years ago

As an update to the great summary in this issue description: Vega-Lite now has support for crossfiltering. See for a small example: https://vega.github.io/new-editor/?mode=vega-lite&spec=layered_crossfilter

cross

monfera commented 7 years ago

Evolving this small spreadsheet engine toward use at Plotly

monfera commented 7 years ago

Interesting writeup that's also a jumpboard to academic papers on incremental computation esp. from a rendering viewpoint. It currently focuses on React and D3 and doesn't touch on functional reactive programming or observables. https://blogs.janestreet.com/incrementality-and-the-web/

monfera commented 7 years ago

An interesting topic with crossfiltering is the concept of the filtered set. A basic 'dashboard' (Coordinated Multiple Views) made with crossfilter.js has the concept of one filtered set only, e.g. as seen with Mike Bostock's canonical Flights crossfiltering. In many real life applications, users still want to see the entire dataset (e.g. grey markers) such that the currently retained selection is differentiated via increased salience (see example with magenta markers). Of course it's sometimes useful to show the full set and the crossfilter-retained set as separate facets but channels are not a concern for this comment.

So, there are often two sets, context and focus. Data scientists and analysts often ask for a third, even more modal set, which is a current selection (be it hover, click or multiple click activated).

A global filter which would constrain what's referred to above as the context is also common, for example, for limiting data transfer to the browser, or more interestingly, for adhering to whatever (temporal, spatial, other or combined) context is relevant for the user. So in summary:

This paper on Keshif uses three sets (besides discussing other topics):

jackparmer commented 7 years ago

For those interested, the current state-of-the-art of crossfilter and Plotly.js is documented in the open data public health initiative.

There is also a roadmap with planned next steps.

measles-crossfilter

nite commented 6 years ago

I've created a small dash app that implements bostock's http://square.github.io/crossfilter crossfilter.js example here: https://gist.github.com/nite/aff146e2b161c19f6d553dc0a4ce3622 - not quite the same level of realtime & slick UI/UX as the original, but good enough for a PoC. Currently hosted on https://crossfilter-dash.herokuapp.com, otherwise create a venv, pip install -r requirements.txt & run app.py

etpinard commented 5 years ago

Closing, as this ticket won't lead to any work in this repo.

We've made plotly.js + crossfilter.js example repo -> https://github.com/plotly/plotly.js-crossfilter.js

Maybe we should transfer (still in beta :smirk: ) this ticket over to https://github.com/plotly/plotly.js-crossfilter.js/issues ?

Please let me know if anyone of you would like to continue this discussion.

nicolaskruchten commented 5 years ago

Such a great read!