Closed tomwhite closed 2 months ago
One issue that this work has highlighted is whether we allow chunking in core dimensions, ploidy in this case. With Dask we do allow ploidy to be chunked, but this only works because we rely on Dask's map_blocks to transparently concatenate chunks along chunked core dimensions. The Dask documentation at https://docs.dask.org/en/stable/generated/dask.array.map_blocks.html actually warns against relying on this, and says
Due to memory-size-constraints, it is often not advisable to use
drop_axis
on an axis that is chunked.
So I think we should explicitly disallow this case, and fail if the input dataset is chunked in the ploidy dimension. There is no reason to chunk in this dimension, and our VCF converters in sgkit and bio2zarr never do.
I bring it up in this PR since Cubed will never concatenate chunks in this way, which exposed the issue.
So I think we should explicitly disallow this case, and fail if the input dataset is chunked in the ploidy dimension. There is no reason to chunk in this dimension, and our VCF converters in sgkit and bio2zarr never do.
Agreed - there's no good reason to allow this. 2D chunks are more than enough complication here!
Added a check for chunking in the ploidy dimension.
Sorry for the radio silence - I was on leave. This approach looks good to me. There are a couple of cases where we use da.gufunc
instead of da.map_blocks
. But those should be trivial to replace with da.apply_gufunc
if we want to match xarray/cubed.
Sorry for the radio silence - I was on leave. This approach looks good to me. There are a couple of cases where we use
da.gufunc
instead ofda.map_blocks
. But those should be trivial to replace withda.apply_gufunc
if we want to match xarray/cubed.
Thanks @timothymillar! That sounds good to me. My plan is to get the aggregation functions needed for QC working under Cubed next - sample_stats
, variant_stats
, hardy_weinberg_test
.
See #908
This takes an alternative approach to the one in #1249, by using
map_blocks
on both Dask and Cubed, rather than Xarray'sapply_ufunc
. This avoids the possible issue of the Dask graph changing (originally from #871) that was seen in #1249 by keeping the Dask code path the same.I think this is the more pragmatic path forward as it allows us to experiment with Cubed by making minimal changes to Dask. (It also doesn't preclude using
apply_ufunc
in the future.)Would love to get your thoughts @jeromekelleher and @timothymillar.