Further work to improve the default configuration of dask chunk sizes, and adding options to control or change chunk sizes if needed. These changes are based on experiences running large computations over genotype data on a distributed cluster and observing performance and memory usage within the cluster.
For most functions, revert to using "native" as the default value for the chunks parameter. This is the best choice for functions likely to access data for only a limited genome region, or for general functions where you don't know how much data will be accessed.
For some functions where you know users are likely to be scanning genotype data for whole contigs or chromosomes, use a larger default chunk size aiming at ~300MiB, which is roughly 10x the native zarr size.
Users can override the defaults by providing a value for the chunks parameter. If this is given as a target size in memory, e.g., "300MiB", then this will be parsed and used to increase the size of dask chunks, but for arrays with more than one dimension only. From experiments, increasing the chunks for one-dimensional arrays leads to high memory usage.
Also some fixes to correct the GCS buckets to use the new single-region buckets.
Further work to improve the default configuration of dask chunk sizes, and adding options to control or change chunk sizes if needed. These changes are based on experiences running large computations over genotype data on a distributed cluster and observing performance and memory usage within the cluster.
chunks
parameter. This is the best choice for functions likely to access data for only a limited genome region, or for general functions where you don't know how much data will be accessed.chunks
parameter. If this is given as a target size in memory, e.g., "300MiB", then this will be parsed and used to increase the size of dask chunks, but for arrays with more than one dimension only. From experiments, increasing the chunks for one-dimensional arrays leads to high memory usage.Also some fixes to correct the GCS buckets to use the new single-region buckets.