Dask support for first, last, first_n and last_n reductions

ianthomas23 commented 1 year ago

This fixes the Dask parts of issues #1182 and #1207. CUDA support will follow in a separate PR, this is already too big.

Full list of example reductions this supports on CPU with and without Dask:

first('value') - returns single value per pixel
last('value') - ditto
first_n('value', n=3) - returns n=3 values per pixel
last_n('value', n=3) - ditto
where(first('value')) - returns single row index per pixel
where(last('value')) - ditto
where(first_n('value', n=3)) - returns n=3 row indexes per pixel
where(last_n('value', n=3)) - ditto
where(first('value'), 'other') - returns single other per pixel
where(last('value'), 'other') - ditto
where(first_n('value', n=3), 'other') - returns n=3 others per pixel
where(last_n('value', n=3), 'other') - ditto

Summary of changes:

Some reduction behaviour is now dependent on whether supplied data is partitioned (dask or CUDA) or not, so extra arguments cuda and partitioned are passed through the class hierarchy.
I have factored out some common first and last code into a private base class _first_or_last to simplify, and similar applies to _first_n_or_last_n.
The fundamental addition is that first('value') using Dask is implemented as where(_min_row_index(), 'value'). This takes advantage of the existing where machinery and a new class _min_row_index which is very similar to a min reduction except that it accepts a virtual row index column, not a real column supplied by the user. I could have extended the min class to cover this, but I thought it better to keep the new functionality in a new private class. It is possible to instantiate the private class of course, although no user code should do this, and I have done this in the test suite. There are equivalent new private classes for last, first_n and last_n.

I do need to recheck that trying to use CUDA with these reductions raises informative error messages not crashes.

codecov[bot] commented 1 year ago

Codecov Report

Merging #1214 (25bd257) into main (8092f4d) will increase coverage by 0.31%. The diff coverage is 95.09%.

@@            Coverage Diff             @@
##             main    #1214      +/-   ##
==========================================
+ Coverage   84.52%   84.83%   +0.31%     
==========================================
  Files          35       35              
  Lines        8369     8561     +192     
==========================================
+ Hits         7074     7263     +189     
- Misses       1295     1298       +3

Impacted Files	Coverage Δ
datashader/data_libraries/pandas.py	`100.00% <ø> (ø)`
datashader/reductions.py	`84.68% <93.15%> (+1.56%)`	:arrow_up:
datashader/compiler.py	`91.07% <100.00%> (+0.50%)`	:arrow_up:
datashader/data_libraries/dask.py	`95.23% <100.00%> (+0.07%)`	:arrow_up:
datashader/data_libraries/dask_xarray.py	`98.95% <100.00%> (ø)`
datashader/utils.py	`81.86% <100.00%> (+2.61%)`	:arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

ianthomas23 commented 1 year ago

I may or may not have glazed over when reviewing certain bits. :-)

It hadn't occurred to me to be entertaining. I can try that approach, but it may come at the expense of code accuracy :grin:

ianthomas23 commented 1 year ago

I've rebased it on main to pick up the fast CUDA mutex PR, and added a couple of error messages. There are no functional code changes so I will merge then CI is green.

holoviz / datashader

Dask support for first, last, first_n and last_n reductions #1214

Codecov Report