More options for automatic sizing of chunks

alimanfoo commented 2 months ago

I'm finding that the native chunks in the zarr genotype data are too small, which means that dask computations struggle with too many tasks. Increasing the size of chunks for genotype arrays helps with larger computations like SNP allele counts and biallelic diplotypes, which are required for PCA, NJT and other analytical functions.

This PR adds some new convenience values for the chunks parameter which activate automatic chunk size selection but only for arrays with more than one dimension. This is necessary because automatic size selection for one-dimensional arrays can lead to high memory usage, particularly when applying a site filter.

Also the default value for the chunks parameter has changed to ndauto0 which I find gives better performance on distributed clusters and has no performance impact either way on colab.

Note if using "auto" or any of the "ndauto..." values, the target chunk size is 128MiB by default but can be changed, e.g.:

import dask
dask.config.set({"array.chunk-size": "256MiB"})

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 95.56%. Comparing base (80820f3) to head (6a70fc7). Report is 14 commits behind head on master.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #615 +/- ## ========================================== - Coverage 95.69% 95.56% -0.13% ========================================== Files 39 39 Lines 3853 3860 +7 ========================================== + Hits 3687 3689 +2 - Misses 166 171 +5 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

review-notebook-app[bot] commented 2 months ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

malariagen / malariagen-data-python

More options for automatic sizing of chunks #615

Codecov Report