Open tomwhite opened 2 years ago
I've managed to get some basic aggregation tests in test_aggregation.py
passing with the changes here: https://github.com/tomwhite/sgkit/commit/83ff40011b1c985cfca086d3fdf70edb371b3689. This is not to be merged as it's just a demonstration at the moment. Most of the changes are due to the array API being stricter on types (so it needs some explicit casts).
They rely on some changes in xarray too: https://github.com/pydata/xarray/pull/7067.
Also, this example shows that Cubed works with Numba (locally at least), which answers @hammer's question here: https://github.com/pystatgen/sgkit/issues/885#issuecomment-1209288596.
Since I opened this issue almost two years ago, Xarray has added a chunk manager abstraction (https://docs.xarray.dev/en/stable/internals/chunked-arrays.html), which makes it a lot easier to switch from Dask to Cubed as the backend computation engine without changing the code to express the computation. The nice thing about this approach is that we can use Dask or Cubed or any other distributed array engine that Xarray might support in the future (such as Arkouda).
I've started to explore what this might look like in https://github.com/tomwhite/sgkit/tree/xarray-apply-ufunc, but the two main ideas are:
dask.array.map_blocks
to xarray.apply_ufunc
for applying functions in parallel,The code in the branch does this for count_call_alleles
. As you can see in this commit (https://github.com/sgkit-dev/sgkit/commit/383398214da8a9184fbe8ff9c874a726f68ba72f), another minor benefit of using xarray.apply_ufunc
is we can use named dimensions like ploidy
and alleles
rather than dimension indexes like 2.
This commit (https://github.com/sgkit-dev/sgkit/commit/da8657e994bfad2214eeeebb3c50f4e46d86577a) shows the new pytest command-line option to run on cubed: --use-cubed
.
I would be interested in any thoughts on this direction @jeromekelleher, @hammer, @timothymillar, @benjeffery, @ravwojdyla, @eric-czech.
I'd like to set up a CI workflow that adds --use-cubed
and runs just the tests for count_call_alleles
to start with, before expanding to cover more of sgkit's aggregation functions.
Here's a successful run for the count_call_alleles
tests on Cubed: https://github.com/tomwhite/sgkit/actions/runs/10455603818/job/28950946965
This sounds like an excellent approach +1
This is an umbrella issue to track the work needed to run sgkit on Cubed.
This is possible because Cubed exposes the Python array API standard as well as common Dask functions and methods like
map_blocks
andArray.compute
. Also, there is ongoing work to Integrate cubed in xarray, as a part of exploring alternative parallel execution frameworks in xarray.