Directly support HoloViews-style inspect operations

jbednar commented 2 years ago

When Datashader renders a large dataset, a human being is usually able to see patterns and interesting datapoints that merit further investigation. Unfortunately, the rendered image does not provide any easy means of doing so, as the original datapoints have all been reduced to pixels (or more accurately, to scalar accumulated values in bins of a 2d histogram). To support investigation of interesting features, HoloViews implements a series of "inspect" operations that query the original dataset after a selection or hover event on the rasterized data. E.g. inspect_points in https://examples.pyviz.org/ship_traffic will query the original dataset to show hover and other information about the original datapoints being visualized. However, going back to the original dataset is quite slow, because it requires traversing either the entire dataset or (for a spatially indexed data structure) at least a chunk of the dataset, which makes the interface unpleasant and awkward and thus eliminating certain types of interactivity.

Datashader can collect multiple aggregations on a single pass through the data, so I suggest that we support an accumulation mode that gathers datapoint indexes rather than datapoints, so that hover and drilldown information can be supported instantaneously. Of course, arbitrarily many datapoints can be aggregated into a single pixel, while any practical aggregation can only accumulate a fixed number of indexes per pixel. Still, that's already how the inspect_ operations work; they discard all but a configurable number of results, which is fine for linking to one or two examples per pixel, and allows single-datapoint precision with enough zooming in. By default I'd suggest accumulating the index of the minimum and the maximum value per pixel, but even just keeping the first or last datapoint for that pixel would be useful.

If we keep at least three datapoints per pixel (e.g. min, max, and one other) we'd be able to distinguish between complete and incomplete inspection data for that pixel (i.e. are these the only points? Yes, if there are 2 or fewer; unclear otherwise). Seems to me that we should be able to have a fully responsive, fully inspectable rendering of a dataset at low computational and memory cost using this method.

jlstevens commented 2 years ago

After some discussion, I think we agreed that a first and last index (per bin) aggregate makes sense and that a where aggregator (e.g. 'show the ship with the highest tonnage value that contributed to this pixel') would be nice too.

The only other point that I think is important is that you need to know the count because that can give you the context to know whether the 'first' or 'last' index is unique or a random sample that contributed to the pixel value (ignoring the possibility that it is meaningful due to sorting e.g. by time)

jbednar commented 2 years ago

@philippjfr suggests implementing a where aggregator that returns some column's value given an aggregator that's applied to some other column, e.g. where(max('value'), 'index'). That way a user can define which samples are kept.

I believe this syntax could support an n argument, retaining the top n values (e.g. the n largest value datapoints encountered). It will be important to clearly indicate in the documentation the conditions under which it's just n arbitrary samples compared to the top n along a well-defined measure. The default for plotting purposes would probably need to be arbitrary since counts are plotted by default and counts don't establish any ordering between datapoints. In that case a single datapoint is probably the most reasonable default (one exemplar per pixel), i.e. a default like where(..., 'index', n=1).

ianthomas23 commented 1 year ago

Some of this can already be done in datashader, e.g.

import datashader as ds
import pandas as pd

df = pd.DataFrame(dict(x=[0, 1, 0, 1, 0], y=[0, 0, 1, 1, 0], myindex=[4, 5, 6, 7, 8]))
canvas = ds.Canvas(3, 3)
agg = canvas.line(
    df, "x", "y",
    agg=ds.summary(count=ds.count(), first=ds.first("myindex"), last=ds.last("myindex")),
)

which produces

<xarray.Dataset>
Dimensions:  (x: 3, y: 3)
Coordinates:
  * x        (x) float64 0.1667 0.5 0.8333
  * y        (y) float64 0.1667 0.5 0.8333
Data variables:
    count    (y, x) uint32 2 1 1 0 2 0 1 1 1
    first    (y, x) float64 4.0 4.0 4.0 nan 5.0 nan 5.0 6.0 6.0
    last     (y, x) float64 7.0 4.0 4.0 nan 7.0 nan 5.0 6.0 6.0

and you can read individual variables using agg['first'] or similar. Note that I have manually added the myindex column to the DataFrame, and ds.first and ds.last always return floats.

Longer term ideas like where(max('value'), 'myindex') require some infrastructure changes because that needs two reductions to interact on a per-pixel basis which currently is not supported; all current reductions are independent.

Eventually that could lead to where(max_n('value', n=3), 'myindex'). We would first need max_n as a standalone reduction that needs to write to a 3D array of shape (ny, nx, n); this also needs infrastructure changes.

I am hoping that the example above is sufficient to start implementing support for this in holoviews. That should give me time to work on a refactor of the canvas/reduction code in datashader to make adding the new reductions much easier.

ianthomas23 commented 1 year ago

Possible API for where reduction:

where(selector: Reduction, lookup: str | None = None)

(although I have just made up the names selector and lookup and they can easily change).

If the user specifies a string name for lookup then it is the name of the column that must already be in the DataFrame and are the values returned to the user based on the selector. If lookup is None then Datashader uses the index of the row in the DataFrame instead.

jbednar commented 1 year ago

@Hoxbro , @jlstevens , @ianthomas23, @mattpap , thanks for your recent work making this closer to reality! Can you please chime in here with the remaining tasks involved? What I am aware of:

Ian: first/last on Datashader using Dask (datashader#1182)
Mateusz: Small issue with Bokeh hover? (Need a new issue.)
Jean-Luc: custom hovertool support in HoloViews (Need an issue)
All: Work out good defaults for HoloViews that give hover information approximating what the pre-datashaded Bokeh plot would include.

jlstevens commented 1 year ago

I think that is a good summary of what is needed.

For the Bokeh hover tool, my understanding was that the necessary changes would be fairly straightforward to implement but that some API changes/additions are also needed. @mattpap can correct me if I am wrong!

ianthomas23 commented 1 year ago

Datashader: what you have at the moment is support for max, max_n, min and min_n reductions on CPU, GPU and dask, on their own and within a where reduction. Needed are:

first and last need dask and GPU support (this is the issue you were looking for: #1182).
first_n and last_n need dask and GPU support.
The above 4 reductions need to be supported within a where reduction.

In HoloViews I don't think there is built-in support for calling Bokeh's categorical colormapping or Datashader's where reduction yet, but this probably needs @Hoxbro to confirm?

mattpap commented 1 year ago

For the Bokeh hover tool, my understanding was that the necessary changes would be fairly straightforward to implement but that some API changes/additions are also needed. @mattpap can correct me if I am wrong!

If this is what we discussed last week, then it requires some changes to make referencing custom formatter more robust (and hopefully deprecate HoverTool.formatters).

jbednar commented 1 year ago

Ok, please open the appropriate issues and then link back here! Thanks.

mattpap commented 1 year ago

I actually found a way to work around limitations related to referencing custom formatters. Consider this example (based on bokeh's examples/plotting/customjs_hover.py):

from bokeh.models import CustomJSHover, HoverTool
from bokeh.plotting import figure, show

# range bounds supplied in web mercator coordinates
p = figure(
    x_range=(-2000000, 6000000), y_range=(-1000000, 7000000),
    x_axis_type="mercator", y_axis_type="mercator",
)
p.add_tile("CartoDB Positron")

p.circle(x=[0, 2000000, 4000000], y=[4000000, 2000000, 0], size=30)

formatter = CustomJSHover(code="""
    const projections = Bokeh.require("core/util/projections")
    const {x, y} = special_vars
    const coords = projections.wgs84_mercator.invert(x, y)
    const dim = format == "x" ? 0 : 1
    return coords[dim].toFixed(2)
""")

p.add_tools(HoverTool(
    tooltips=[
        ("lon", "$x{x}"),
        ("lat", "$y{y}"),
    ],
    formatters={
        "$x": formatter,
        "$y": formatter,
    },
))

show(p)

Given that the contents of {} can be anything except empty and custom has no intrinsic meaning (in fact it's not referenced in the implementation at all). Thus you can use it to enumerate possible implementations of a custom formatter. This translates nicely to the example @Hoxbro sent me. Note that I would consider this is a bit of an abuse of the API.

hoxbro commented 1 year ago

Thank you @mattpap. Got it to work with your example.

I assume you would still want to make custom formatters more robust?

philippjfr commented 1 week ago

@jbednar I'd say we close this. We have other issues to actually leverage the new aggregates for inspection purposes in the other repos and afaik where along with <agg>_n covers everything we need out of datashader.

holoviz / datashader

Directly support HoloViews-style inspect operations #1126