holoviz / datashader

Quickly and accurately render even the largest data.
http://datashader.org
BSD 3-Clause "New" or "Revised" License
3.26k stars 363 forks source link

Dask support for first, last, first_n and last_n reductions #1214

Closed ianthomas23 closed 1 year ago

ianthomas23 commented 1 year ago

This fixes the Dask parts of issues #1182 and #1207. CUDA support will follow in a separate PR, this is already too big.

Full list of example reductions this supports on CPU with and without Dask:

Summary of changes:

  1. Some reduction behaviour is now dependent on whether supplied data is partitioned (dask or CUDA) or not, so extra arguments cuda and partitioned are passed through the class hierarchy.

  2. I have factored out some common first and last code into a private base class _first_or_last to simplify, and similar applies to _first_n_or_last_n.

  3. The fundamental addition is that first('value') using Dask is implemented as where(_min_row_index(), 'value'). This takes advantage of the existing where machinery and a new class _min_row_index which is very similar to a min reduction except that it accepts a virtual row index column, not a real column supplied by the user. I could have extended the min class to cover this, but I thought it better to keep the new functionality in a new private class. It is possible to instantiate the private class of course, although no user code should do this, and I have done this in the test suite. There are equivalent new private classes for last, first_n and last_n.

I do need to recheck that trying to use CUDA with these reductions raises informative error messages not crashes.

codecov[bot] commented 1 year ago

Codecov Report

Merging #1214 (25bd257) into main (8092f4d) will increase coverage by 0.31%. The diff coverage is 95.09%.

@@            Coverage Diff             @@
##             main    #1214      +/-   ##
==========================================
+ Coverage   84.52%   84.83%   +0.31%     
==========================================
  Files          35       35              
  Lines        8369     8561     +192     
==========================================
+ Hits         7074     7263     +189     
- Misses       1295     1298       +3     
Impacted Files Coverage Δ
datashader/data_libraries/pandas.py 100.00% <ø> (ø)
datashader/reductions.py 84.68% <93.15%> (+1.56%) :arrow_up:
datashader/compiler.py 91.07% <100.00%> (+0.50%) :arrow_up:
datashader/data_libraries/dask.py 95.23% <100.00%> (+0.07%) :arrow_up:
datashader/data_libraries/dask_xarray.py 98.95% <100.00%> (ø)
datashader/utils.py 81.86% <100.00%> (+2.61%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

ianthomas23 commented 1 year ago

I may or may not have glazed over when reviewing certain bits. :-)

It hadn't occurred to me to be entertaining. I can try that approach, but it may come at the expense of code accuracy :grin:

ianthomas23 commented 1 year ago

I've rebased it on main to pick up the fast CUDA mutex PR, and added a couple of error messages. There are no functional code changes so I will merge then CI is green.