PyData prototype backend dispatching

eric-czech commented 4 years ago

Two immediately necessary uses for this are:

Dispatching to IO backends
Dispatching to array backends

The second is far more complicated and a separate framework may not be necessary for the first, but it would be great to support both the same way.

On array dispatch, I think dispatching based on argument types is not enough. We will likely have many functions that take multiple array args and if they are a mix of dask/numpy/sparse arrays, a better solution to supporting this is likely to have the user declare what backend API should be preferred and then special-case coercion where necessary.

At a minimum, I think we should keep CuPy, Dask, and Numpy backends in mind since we already know how different implementations of genetic algorithms are going to be based on Alistair's skallel v2 prototype. Each backend will definitely need to use API-specific functionality but a lot of operations will be dispatchable purely through the numpy API too. A good question to answer would be whether or not literally using numpy is better for the latter or if unumpy will make more sense. The backend dispatching model in unumpy seems like a good fit but I don't know if aligning to this long-term is worth the extra dependencies. I think it will depend on how much non API-specific code we actually need.

eric-czech commented 4 years ago

cf. this thread on dispatch in Xarray: https://github.com/pydata/xarray/issues/1938

hammer commented 4 years ago

Some interesting discussion also happening at https://github.com/pydata/xarray/issues/3213#issuecomment-615772303 with regards to scipy.sparse and pydata/sparse, which may be "backends" to consider as well.

eric-czech commented 4 years ago

Some more notes/questions:

Do file readers load as chunked array or not?
- I think it's useful to draw a distinction between dispatching to a "platform" (for lack of a better generic term) for array computation as well as individual array backends (dask is both)
Do file readers load chunks using a specific backend?
- It would not be unreasonable to expect users to need to run da.map_blocks(backend_module.asarray) after reads, but some readers may be much more efficient if they aren't loading chunks as some default duck array type (probably numpy) and then undergoing conversion
Should our genetics methods assume that the same target array backend can be used for all n-ary numpy functions?
- This would be an argument against using unumpy/uarray
- For example, given a numpy array of call data and a sparse mask array for missing calls, should an element-wise multiplication of the two be sparse.COO or np.ndarray (or possibly masked)? If the sparsity is high enough, it should be the former. Assuming a user has specified this by setting the SparseBackend and that the method produces dense results, attempting to stack them will fail:
```
import unumpy as unp
import uarray as ua
import unumpy.sparse_backend as SparseBackend
with ua.set_backend(SparseBackend):
  unp.stack([np.array([1]), np.ones([1])])
# ValueError: All arrays must be instances of SparseArray.
```
- An alternative would be to make something like our "CuPyBackend" backend more of a loose contract that the bulk of the work that happens will be done with CuPy, and that whatever is left is more or less up to us to decide how to implement, by choosing array backends as we see fit within the scope of what is installed. I think this is more realistic given the scope of what our more complicated methods will encompass.
Xarray and xgcm use a "duck_array_ops" module that is basically a switch like getattr(dask.array if dask_installed else numpy, numpy_function)(*args, **kwargs) for handling chunked vs. unchunked dispatching
- For Xarray in particular, this also includes special cases for inconsistencies between dask/numpy as well as implementations from some things outside of the scope of ufunc and array_function protocols

related-sciences / gwas-analysis

PyData prototype backend dispatching #24