holoviz / datashader

Quickly and accurately render even the largest data.
http://datashader.org
BSD 3-Clause "New" or "Revised" License
3.26k stars 363 forks source link

Support by(max_n) and by(min_n) #1229

Closed ianthomas23 closed 1 year ago

ianthomas23 commented 1 year ago

Support for categorical max_n and min_n reductions such as ds.by("cat", ds.max_n("value", n=3)) on CPU and GPU both with and without dask. This is the first part of issue #1210, support for categorical first_n, last_n and where to follow.

Example:

import datashader as ds
import numpy as np
from numpy import nan
import pandas as pd

x = np.arange(2)
df = pd.DataFrame(dict(
    y_from = [0.0, 1.0, 0.0, 1.0, 0.0],
    y_to   = [0.0, 1.0, 1.0, 0.0, 0.5],
    value  = [1.1, 3.3, 5.5, 2.2, 4.4],
    cat    = ['a', 'b', 'a', 'b', 'a'],
))
df["cat"] = df["cat"].astype("category")

canvas = ds.Canvas(plot_height=2, plot_width=3)
agg = canvas.line(source=df, x=x, y=["y_from", "y_to"], axis=1,
                  agg=ds.by("cat", ds.max_n("value", n=3)))
print(agg)

which prints

xarray.DataArray (y: 2, x: 3, cat: 2, n: 3)>
array([[[[5.5, 4.4, 1.1],
         [nan, nan, nan]],

        [[1.1, nan, nan],
         [2.2, nan, nan]],

        [[1.1, nan, nan],
         [2.2, nan, nan]]],

       [[[nan, nan, nan],
         [3.3, 2.2, nan]],

        [[5.5, 4.4, nan],
         [3.3, nan, nan]],

        [[5.5, 4.4, nan],
         [3.3, nan, nan]]]])
Coordinates:
  * x        (x) float64 0.1667 0.5 0.8333
  * y        (y) float64 0.25 0.75
  * cat      (cat) <U1 'a' 'b'
  * n        (n) int64 0 1 2
Attributes:
    x_range:  (0, 1)
    y_range:  (0.0, 1.0)

Note that the returned DataArray has shape (ny, nx, ncat, n) which I think is more logical than the alternative possibility of (ny, nx, n, ncat).

In terms of implementation, functions like nanmax_n_in_place now always accept a 4D array so that there is a single implementation for 3D (max) and 4D (max_n) arrays for each of CPU and GPU. Use of the combine function in max inserts the extra dimension of size 1 to change the shape without copying any data.

codecov[bot] commented 1 year ago

Codecov Report

Merging #1229 (bfdcc66) into main (28c8581) will decrease coverage by 0.04%. The diff coverage is 67.79%.

@@            Coverage Diff             @@
##             main    #1229      +/-   ##
==========================================
- Coverage   83.62%   83.59%   -0.04%     
==========================================
  Files          35       35              
  Lines        8738     8751      +13     
==========================================
+ Hits         7307     7315       +8     
- Misses       1431     1436       +5     
Impacted Files Coverage Δ
datashader/transfer_functions/_cuda_utils.py 20.63% <0.00%> (ø)
datashader/reductions.py 79.02% <47.05%> (-0.22%) :arrow_down:
datashader/compiler.py 88.42% <100.00%> (+0.06%) :arrow_up:
datashader/utils.py 81.63% <100.00%> (+0.09%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

jbednar commented 1 year ago

We'll need some docs at the Datashader level when you're done with all this, of course.