holoviz / datashader

Quickly and accurately render even the largest data.
http://datashader.org
BSD 3-Clause "New" or "Revised" License
3.3k stars 365 forks source link

Where reduction using dataframe row index #1164

Closed ianthomas23 closed 1 year ago

ianthomas23 commented 1 year ago

This is built on top of PR #1155 and ideally that should be merged first, then this rebased on top of it. I am submitting it early to run it through CI.

It supports the use of the where reduction without specifying the lookup_column argument to return an agg containing the corresponding row indexes from the pandas/dask DataFrame. The agg returned is int64 with -1 to represent missing values. Implementing the row index for pandas DataFrames is quite simple, for dask DataFrames the implementation is more complicated as this information is not normally available and the index of the DataFrame cannot be relied upon in all scenarios.

Demo code:

import datashader as ds
import numpy as np
import pandas as pd

df = pd.DataFrame(dict(
    x     = [ 0,  0,  1,  1,  0,  0,  2,  2],
    y     = [ 0,  0,  0,  0,  1,  1,  1,  1],
    value = [ 9,  8,  7,  6,  2,  3,  4,  5],
    other = [11, 12, 13, 14, 15, 16, 17, 18],
    #index    0   1   2   3   4   5   6   7
))

canvas = ds.Canvas(plot_height=2, plot_width=3)

reductions = [
    ("where first index", ds.where(ds.first("value"))),
    ("where last index", ds.where(ds.last("value"))),
    ("where max index", ds.where(ds.max("value"))),
    ("where max other", ds.where(ds.max("value"), "other")),
    ("where min index", ds.where(ds.min("value"))),
    ("where min other", ds.where(ds.min("value"), "other")),
]

for name, reduction in reductions:
    agg = canvas.points(df, 'x', 'y', agg=reduction)
    print(name, agg.data.dtype)
    print(agg.data)

which outputs

where first index int64
[[ 0  2 -1]
 [ 4 -1  6]]
where last index int64
[[ 1  3 -1]
 [ 5 -1  7]]
where max index int64
[[ 0  2 -1]
 [ 5 -1  7]]
where max other float64
[[11. 13. nan]
 [16. nan 18.]]
where min index int64
[[ 1  3 -1]
 [ 4 -1  6]]
where min other float64
[[12. 14. nan]
 [15. nan 17.]]

selector reductions that where supports in this way are first, last, max and min. For dask DataFrames this is just max and min so far as first and last do not have any dask implementation.

ianthomas23 commented 1 year ago

This is ready for review. After this is merged we will be in a position to start working on holoviews to use this functionality for improved inspection.

ianthomas23 commented 1 year ago

Looks great! Can you provide a section for the user guide showing how to use this in an example?

Yes, I'll do that in a separate PR.