Open jbednar opened 2 years ago
After some discussion, I think we agreed that a first
and last
index (per bin) aggregate makes sense and that a where
aggregator (e.g. 'show the ship with the highest tonnage value that contributed to this pixel') would be nice too.
The only other point that I think is important is that you need to know the count because that can give you the context to know whether the 'first' or 'last' index is unique or a random sample that contributed to the pixel value (ignoring the possibility that it is meaningful due to sorting e.g. by time)
@philippjfr suggests implementing a where
aggregator that returns some column's value given an aggregator that's applied to some other column, e.g. where(max('value'), 'index')
. That way a user can define which samples are kept.
I believe this syntax could support an n
argument, retaining the top n
values (e.g. the n
largest value
datapoints encountered). It will be important to clearly indicate in the documentation the conditions under which it's just n
arbitrary samples compared to the top n
along a well-defined measure. The default for plotting purposes would probably need to be arbitrary since counts are plotted by default and counts don't establish any ordering between datapoints. In that case a single datapoint is probably the most reasonable default (one exemplar per pixel), i.e. a default like where(..., 'index', n=1)
.
Some of this can already be done in datashader, e.g.
import datashader as ds
import pandas as pd
df = pd.DataFrame(dict(x=[0, 1, 0, 1, 0], y=[0, 0, 1, 1, 0], myindex=[4, 5, 6, 7, 8]))
canvas = ds.Canvas(3, 3)
agg = canvas.line(
df, "x", "y",
agg=ds.summary(count=ds.count(), first=ds.first("myindex"), last=ds.last("myindex")),
)
which produces
<xarray.Dataset>
Dimensions: (x: 3, y: 3)
Coordinates:
* x (x) float64 0.1667 0.5 0.8333
* y (y) float64 0.1667 0.5 0.8333
Data variables:
count (y, x) uint32 2 1 1 0 2 0 1 1 1
first (y, x) float64 4.0 4.0 4.0 nan 5.0 nan 5.0 6.0 6.0
last (y, x) float64 7.0 4.0 4.0 nan 7.0 nan 5.0 6.0 6.0
and you can read individual variables using agg['first']
or similar. Note that I have manually added the myindex
column to the DataFrame
, and ds.first
and ds.last
always return floats.
Longer term ideas like where(max('value'), 'myindex')
require some infrastructure changes because that needs two reductions to interact on a per-pixel basis which currently is not supported; all current reductions are independent.
Eventually that could lead to where(max_n('value', n=3), 'myindex')
. We would first need max_n
as a standalone reduction that needs to write to a 3D array of shape (ny, nx, n)
; this also needs infrastructure changes.
I am hoping that the example above is sufficient to start implementing support for this in holoviews. That should give me time to work on a refactor of the canvas/reduction code in datashader to make adding the new reductions much easier.
Possible API for where
reduction:
where(selector: Reduction, lookup: str | None = None)
(although I have just made up the names selector
and lookup
and they can easily change).
If the user specifies a string name for lookup then it is the name of the column that must already be in the DataFrame and are the values returned to the user based on the selector. If lookup is None
then Datashader uses the index of the row in the DataFrame instead.
@Hoxbro , @jlstevens , @ianthomas23, @mattpap , thanks for your recent work making this closer to reality! Can you please chime in here with the remaining tasks involved? What I am aware of:
I think that is a good summary of what is needed.
For the Bokeh hover tool, my understanding was that the necessary changes would be fairly straightforward to implement but that some API changes/additions are also needed. @mattpap can correct me if I am wrong!
Datashader: what you have at the moment is support for max
, max_n
, min
and min_n
reductions on CPU, GPU and dask, on their own and within a where
reduction. Needed are:
first
and last
need dask and GPU support (this is the issue you were looking for: #1182).first_n
and last_n
need dask and GPU support.where
reduction.In HoloViews I don't think there is built-in support for calling Bokeh's categorical colormapping or Datashader's where
reduction yet, but this probably needs @Hoxbro to confirm?
For the Bokeh hover tool, my understanding was that the necessary changes would be fairly straightforward to implement but that some API changes/additions are also needed. @mattpap can correct me if I am wrong!
If this is what we discussed last week, then it requires some changes to make referencing custom formatter more robust (and hopefully deprecate HoverTool.formatters
).
Ok, please open the appropriate issues and then link back here! Thanks.
I actually found a way to work around limitations related to referencing custom formatters. Consider this example (based on bokeh's examples/plotting/customjs_hover.py
):
from bokeh.models import CustomJSHover, HoverTool
from bokeh.plotting import figure, show
# range bounds supplied in web mercator coordinates
p = figure(
x_range=(-2000000, 6000000), y_range=(-1000000, 7000000),
x_axis_type="mercator", y_axis_type="mercator",
)
p.add_tile("CartoDB Positron")
p.circle(x=[0, 2000000, 4000000], y=[4000000, 2000000, 0], size=30)
formatter = CustomJSHover(code="""
const projections = Bokeh.require("core/util/projections")
const {x, y} = special_vars
const coords = projections.wgs84_mercator.invert(x, y)
const dim = format == "x" ? 0 : 1
return coords[dim].toFixed(2)
""")
p.add_tools(HoverTool(
tooltips=[
("lon", "$x{x}"),
("lat", "$y{y}"),
],
formatters={
"$x": formatter,
"$y": formatter,
},
))
show(p)
Given that the contents of {}
can be anything except empty and custom
has no intrinsic meaning (in fact it's not referenced in the implementation at all). Thus you can use it to enumerate possible implementations of a custom formatter. This translates nicely to the example @Hoxbro sent me. Note that I would consider this is a bit of an abuse of the API.
Thank you @mattpap. Got it to work with your example.
I assume you would still want to make custom formatters more robust?
@jbednar I'd say we close this. We have other issues to actually leverage the new aggregates for inspection purposes in the other repos and afaik where
along with <agg>_n
covers everything we need out of datashader.
When Datashader renders a large dataset, a human being is usually able to see patterns and interesting datapoints that merit further investigation. Unfortunately, the rendered image does not provide any easy means of doing so, as the original datapoints have all been reduced to pixels (or more accurately, to scalar accumulated values in bins of a 2d histogram). To support investigation of interesting features, HoloViews implements a series of "inspect" operations that query the original dataset after a selection or hover event on the rasterized data. E.g.
inspect_points
in https://examples.pyviz.org/ship_traffic will query the original dataset to show hover and other information about the original datapoints being visualized. However, going back to the original dataset is quite slow, because it requires traversing either the entire dataset or (for a spatially indexed data structure) at least a chunk of the dataset, which makes the interface unpleasant and awkward and thus eliminating certain types of interactivity.Datashader can collect multiple aggregations on a single pass through the data, so I suggest that we support an accumulation mode that gathers datapoint indexes rather than datapoints, so that hover and drilldown information can be supported instantaneously. Of course, arbitrarily many datapoints can be aggregated into a single pixel, while any practical aggregation can only accumulate a fixed number of indexes per pixel. Still, that's already how the
inspect_
operations work; they discard all but a configurable number of results, which is fine for linking to one or two examples per pixel, and allows single-datapoint precision with enough zooming in. By default I'd suggest accumulating the index of the minimum and the maximum value per pixel, but even just keeping the first or last datapoint for that pixel would be useful.If we keep at least three datapoints per pixel (e.g. min, max, and one other) we'd be able to distinguish between complete and incomplete inspection data for that pixel (i.e. are these the only points? Yes, if there are 2 or fewer; unclear otherwise). Seems to me that we should be able to have a fully responsive, fully inspectable rendering of a dataset at low computational and memory cost using this method.