Only check dask.DataFrame dtypes of columns actually used

holoviz / datashader

Quickly and accurately render even the largest data.

BSD 3-Clause "New" or "Revised" License

3.26k stars 363 forks source link

Fixes #1235.

In our dask DataFrame workflows we use a prediction of a dtype to return, and previously we tried to calculate one that suited all columns of the DataFrame. This fix restricts the calculation to only look at the columns that we actually use.

In terms of implementation, the columns used have already been identified in the compile_components function so we just need to return them to all callers, and the dask workflow now uses just those columns.

I have been really conservative here. Using up-to-date dependent packages the predicted dtype doesn't matter at all, I can put in anything here and datashader works as expected. But given that this code does some potentially risky things with dask internals I do not want to change it any more than necessary.

Codecov Report

Merging #1236 (441fed4) into main (9f5b411) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1236   +/-   ##
=======================================
  Coverage   83.52%   83.52%           
=======================================
  Files          35       35           
  Lines        8777     8778    +1     
=======================================
+ Hits         7331     7332    +1     
  Misses       1446     1446

Impacted Files	Coverage Δ
datashader/compiler.py	`88.60% <100.00%> (+0.05%)`	:arrow_up:
datashader/data_libraries/dask.py	`92.85% <100.00%> (-2.39%)`	:arrow_down:
datashader/data_libraries/dask_xarray.py	`98.95% <100.00%> (ø)`
datashader/data_libraries/pandas.py	`100.00% <100.00%> (ø)`

... and 1 file with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

holoviz / datashader

Only check dask.DataFrame dtypes of columns actually used #1236

Codecov Report