holoviz / datashader

Quickly and accurately render even the largest data.
http://datashader.org
BSD 3-Clause "New" or "Revised" License
3.3k stars 365 forks source link

Add new where reduction #1155

Closed ianthomas23 closed 1 year ago

ianthomas23 commented 1 year ago

This partially implements issue #1126, adding a new where reduction that accepts either a max or min reduction. Best illustrated via an example:

import datashader as ds
import numpy as np
import pandas as pd

x = np.arange(2)
df = pd.DataFrame(dict(
    y_from = [0.0, 1.0, 0.0, 1.0, 0.0],
    y_to   = [0.0, 1.0, 1.0, 0.0, 0.5],
    value  = [1.1, 3.3, 5.5, 2.2, 4.4],
    other  = [-55, -77, -99, -66, -88],
))

canvas = ds.Canvas(plot_height=3, plot_width=5)
agg = canvas.line(
    source=df, x=x, y=["y_from", "y_to"], axis=1,
    agg=ds.where(ds.max("value"), "other"),
)

print(agg)

which outputs

<xarray.DataArray (y: 3, x: 5)>
array([[-99., -88., -55., -66., -66.],
       [ nan, -99., -99., -88., -88.],
       [-77., -77., -77., -99., -99.]])
Coordinates:
  * x        (x) float64 0.1 0.3 0.5 0.7 0.9
  * y        (y) float64 0.1667 0.5 0.8333

You can think of this using the max('value') reduction as normal, but then returning the corresponding values from the 'other' column rather that the value column.

What it currently supports:

Note that there is no support for use of first and last within a where because there is no advantage in doing this, you can just use the first or last directly on their own.

Future improvements:

All of these are possible but fiddly to implement, so I would rather have partial functionality available for users to experiment with and I can add these improvements over time.

Currently some combinations of lines and dask give different results depending on the number of dask partitions, but this has always been the situation and is no worse here.

jbednar commented 1 year ago

Thanks! Can you clarify the current status of types? I.e. can you return an integer aggregate when testing on a float condition?

ianthomas23 commented 1 year ago

where always returns a float64 with nans to represent no data, just as min, max, first, last etc reductions.

jbednar commented 1 year ago

Ok, I guess we'll need to deal with datatype issues when we support using the Pandas index as the "column" (actually just imputed values that act like a column, hence needing special support).

ianthomas23 commented 1 year ago

Rebased on top of main to pick up the CI fixes.

ianthomas23 commented 1 year ago

The reduction in coverage is mostly due to changes to the CUDA append functions and such CUDA code is not run in github actions.

codecov[bot] commented 1 year ago

Codecov Report

Merging #1155 (b34ffd6) into main (645ae07) will increase coverage by 0.03%. The diff coverage is 83.68%.

@@            Coverage Diff             @@
##             main    #1155      +/-   ##
==========================================
+ Coverage   85.39%   85.43%   +0.03%     
==========================================
  Files          35       35              
  Lines        7819     7941     +122     
==========================================
+ Hits         6677     6784     +107     
- Misses       1142     1157      +15     
Impacted Files Coverage Δ
datashader/core.py 88.05% <ø> (ø)
datashader/reductions.py 86.94% <80.83%> (-0.29%) :arrow_down:
datashader/compiler.py 95.62% <100.00%> (+0.53%) :arrow_up:
datashader/glyphs/line.py 92.95% <0.00%> (+0.09%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

ianthomas23 commented 1 year ago

Pinging @jbednar. I'd like to merge this and add the extra functionality (such as use of a virtual integer row index) as separate PRs.