holoviz / datashader

Quickly and accurately render even the largest data.
http://datashader.org
BSD 3-Clause "New" or "Revised" License
3.3k stars 365 forks source link

Using categorical coloring for separate aggregates #513

Open jbednar opened 6 years ago

jbednar commented 6 years ago

If we have a dataframe with points in it that each have a category assigned in some other column, we can generate a single image from it where each pixel's color is an average of the category colors, weighted by the counts for each category:

image

However, weighted-color-average plots are also useful in cases where no category field is available. Right now, if you wanted to use the category coloring to show NYC taxi pickups vs. dropoffs, you could create a new data frame twice as long as the old one, with each row representing a pickup or dropoff only (instead of a pickup,dropoff pair as it is now), and synthesize a new column indicating whether each point was a pickup or a dropoff. It seems helpful if we provide at least an example, if not a utility, of how to avoid having to doctor the original dataset in this way, because we should be able to make the same calculation simply from the separate aggregates by packing them into the appropriate xarray data structure expected by shade() when given categorical data.

philippjfr commented 6 years ago

This issue is basically about working with wide rather than tall or tidy data. One suggestion I'd have is accepting lists for the Glyph x and y column references. Then you could express your dropoff/pickup example as:

canvas.points(df, ['pickup_x', 'dropoff_x'], ['dropoff_x', 'dropoff_y'])

and express the kind of problem described in https://github.com/bokeh/datashader/pull/512 as:

canvas.line(df, 'x', ['col1', 'col2', 'col3', ...])

This would make it quite easy to work with wide data with a lot of columns you want to aggregate on. I'm sure internally this could also be made efficient.

jbednar commented 6 years ago

Right, the example above was about wide data, and I agree that your proposed syntax would make such an example more convenient to do.

In general, though, we should be able to use weighted-color-average plots for any suitable data, including separate aggregate arrays from arbitrary sources (e.g. different dataframes altogether) or from the same source but with arbitrary custom xarray operations on each one before rendering. So we'd still need a standalone example or utility showing how to combine the separate aggregates into the data structure expected by shade().

jbednar commented 6 years ago

Not that such an example should be difficult; I expect it's a one-liner; I just don't have time today to create it, hence the issue. :-)