holoviz / datashader

Quickly and accurately render even the largest data.
http://datashader.org
BSD 3-Clause "New" or "Revised" License
3.3k stars 365 forks source link

Dask and CUDA support for first and last reductions #1182

Closed ianthomas23 closed 1 year ago

ianthomas23 commented 1 year ago

first and last reductions are not implemented for dask or CUDA. This is because both work in parallel and we have no control over the order of running, hence up until now we have had no way of determining what is first and last at the point when we need to combine data from different dask partitions or different CUDA threads.

However, there is a way forward by reusing the virtual row index of the recently added where reduction (#1164). This allows each row of a DataFrame to know its row index, so if this information carries through the pipeline until the point at which we, for example, combine the results from separate dask partitions then first and last can be calculated correctly.

In pseudocode, consider that

ds.first("some_column")

is equivalent to

ds.where(ds.min(virtual_row_index), "some_column")

It isn't quite as simple as that as we need to use the where reduction the other way round from normal, i.e. using the virtual row index to determine what values of "some_column" to return when in fact where so far can only use the values of "some_column" to determine what virtual row index to return. But the existing machinery can be generalised for reuse here.

This will need #1177 implemented first for CUDA support.

ianthomas23 commented 1 year ago

Dask support is provided in PR #1214.