Run test_count_call_alleles on Cubed (alternative approach)

sgkit-dev / sgkit

Scalable genetics toolkit

https://sgkit-dev.github.io/sgkit

Apache License 2.0

235 stars 32 forks source link

Run test_count_call_alleles on Cubed (alternative approach) #1254

Closed tomwhite closed 2 months ago

tomwhite commented 2 months ago

See #908

This takes an alternative approach to the one in #1249, by using map_blocks on both Dask and Cubed, rather than Xarray's apply_ufunc. This avoids the possible issue of the Dask graph changing (originally from #871) that was seen in #1249 by keeping the Dask code path the same.

I think this is the more pragmatic path forward as it allows us to experiment with Cubed by making minimal changes to Dask. (It also doesn't preclude using apply_ufunc in the future.)

Would love to get your thoughts @jeromekelleher and @timothymillar.

tomwhite commented 2 months ago

One issue that this work has highlighted is whether we allow chunking in core dimensions, ploidy in this case. With Dask we do allow ploidy to be chunked, but this only works because we rely on Dask's map_blocks to transparently concatenate chunks along chunked core dimensions. The Dask documentation at https://docs.dask.org/en/stable/generated/dask.array.map_blocks.html actually warns against relying on this, and says

Due to memory-size-constraints, it is often not advisable to use drop_axis on an axis that is chunked.

So I think we should explicitly disallow this case, and fail if the input dataset is chunked in the ploidy dimension. There is no reason to chunk in this dimension, and our VCF converters in sgkit and bio2zarr never do.

I bring it up in this PR since Cubed will never concatenate chunks in this way, which exposed the issue.

jeromekelleher commented 2 months ago

So I think we should explicitly disallow this case, and fail if the input dataset is chunked in the ploidy dimension. There is no reason to chunk in this dimension, and our VCF converters in sgkit and bio2zarr never do.

Agreed - there's no good reason to allow this. 2D chunks are more than enough complication here!

tomwhite commented 2 months ago

Added a check for chunking in the ploidy dimension.

timothymillar commented 2 months ago

Sorry for the radio silence - I was on leave. This approach looks good to me. There are a couple of cases where we use da.gufunc instead of da.map_blocks. But those should be trivial to replace with da.apply_gufunc if we want to match xarray/cubed.

tomwhite commented 2 months ago

Sorry for the radio silence - I was on leave. This approach looks good to me. There are a couple of cases where we use da.gufunc instead of da.map_blocks. But those should be trivial to replace with da.apply_gufunc if we want to match xarray/cubed.

Thanks @timothymillar! That sounds good to me. My plan is to get the aggregation functions needed for QC working under Cubed next - sample_stats, variant_stats, hardy_weinberg_test.