holoviz / datashader

Quickly and accurately render even the largest data.
http://datashader.org
BSD 3-Clause "New" or "Revised" License
3.3k stars 365 forks source link

first_n, last_n, max_n and min_n reductions #1184

Closed ianthomas23 closed 1 year ago

ianthomas23 commented 1 year ago

This is further work on improved inspection reductions (issue #1126) to add first_n, last_n, max_n and min_n reductions. Each accepts a column name and value for n, the number of results to return for each pixel. For example, max("value", n=3) will return a DataArray of shape (ny, nx, n) containing the 3 highest values of column "value" for each pixel.

Demo code using these within a where reduction to return corresponding row indexes or values from a different column:

import datashader as ds
import pandas as pd

df = pd.DataFrame(dict(
    x     = [ 0,  0,  1,  1,  0,  0,  2,  2],
    y     = [ 0,  0,  0,  0,  1,  1,  1,  1],
    value = [ 9,  8,  7,  6,  2,  3,  4,  5],
    other = [11, 12, 13, 14, 15, 16, 17, 18],
    #index    0   1   2   3   4   5   6   7
))

canvas = ds.Canvas(plot_height=2, plot_width=3)

reductions = [
    ("where first_n index", ds.where(ds.first_n("value", 3))),
    ("where first_n other", ds.where(ds.first_n("value", 3), "other")),
    ("where max_n index", ds.where(ds.max_n("value", 3))),
    ("where max_n other", ds.where(ds.max_n("value", 3), "other")),
]

for name, reduction in reductions:
    agg = canvas.points(df, 'x', 'y', agg=reduction)
    print(name, agg.data.dtype)
    print(agg.data)

which outputs

where first_n index int64
[[[ 0  1 -1]
  [ 2  3 -1]
  [-1 -1 -1]]

 [[ 4  5 -1]
  [-1 -1 -1]
  [ 6  7 -1]]]
where first_n other float64
[[[11. 12. nan]
  [13. 14. nan]
  [nan nan nan]]

 [[15. 16. nan]
  [nan nan nan]
  [17. 18. nan]]]
where max_n index int64
[[[ 0  1 -1]
  [ 2  3 -1]
  [-1 -1 -1]]

 [[ 5  4 -1]
  [-1 -1 -1]
  [ 7  6 -1]]]
where max_n other float64
[[[11. 12. nan]
  [13. 14. nan]
  [nan nan nan]]

 [[16. 15. nan]
  [nan nan nan]
  [18. 17. nan]]]

where, as usual, -1 means no row index and nan means no data to return.

This allows us to do some complicated combinations such as

ds.summary(
    count=ds.count(),
    min_n=ds.where(ds.min_n("value", n=3)),
    max_n=ds.where(ds.max_n("value", n=3)),
)

to return count plus min_n and max_n (or first_n and last_n) in a single datashader pass.

max_n and min_n work with dask but not CUDA (issue #1177 needs to be solved for that). first_n and last_n only work on the CPU and without dask, the same as first and last (#1177 and #1182 are needed to fix that).

Using antialiased lines the results looked OK in some situation and not others, so I am raising a NotImplemented error for all of these when using with antialiasing and I will separately consider what is reasonable behaviour here. This includes where(first_n) and so on as well.

There is one issue here that needs deciding. I've called the third dimension of the DataArray returned by such a reduction "n" to fit in with the names first_n, etc. You can put multiple whatever_n reductions in a single summary reduction as shown above. If they have the same n then it all works out as expected. But we need a policy on labelling the third dimension if the whatever_n have different n values. We could keep the first n as n, and if subsequent n values are different call them n1, n2, etc?

codecov[bot] commented 1 year ago

Codecov Report

Merging #1184 (b6ed7a5) into main (229cea3) will increase coverage by 0.09%. The diff coverage is 89.83%.

@@            Coverage Diff             @@
##             main    #1184      +/-   ##
==========================================
+ Coverage   85.39%   85.48%   +0.09%     
==========================================
  Files          35       35              
  Lines        8023     8232     +209     
==========================================
+ Hits         6851     7037     +186     
- Misses       1172     1195      +23     
Impacted Files Coverage Δ
datashader/glyphs/line.py 92.84% <ø> (-0.12%) :arrow_down:
datashader/reductions.py 86.17% <86.97%> (+0.17%) :arrow_up:
datashader/compiler.py 95.74% <100.00%> (+0.12%) :arrow_up:
datashader/core.py 88.38% <100.00%> (+0.02%) :arrow_up:
datashader/utils.py 79.25% <100.00%> (+2.40%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

ianthomas23 commented 1 year ago

After discussion, we've decided to allow multiple *_n reductions only if they all have the same n value. This allows us to keep the new coordinate label as n.