Open ivirshup opened 2 years ago
Related: zarr PR for reading in data to a specific array type. The particular use case being GPU arrays https://github.com/zarr-developers/zarr-python/pull/934.
A more flexible (and definitely error prone) way to do this would be to allow users to pass a dispatch function to be used during IO. What that function might look like:
from anndata.io import read_dispatched, IOSpec
import dask.array as da
def layers_as_dask(read_func, f, k: str, spec: IOSpec):
"""
Params
------
read_func
Default function for reading f[k] based on its IOSpec
f
File to read from
k
Key
spec
IOSpec for f[k]
"""
match (k.split("/"), spec):
case (["layers", _], IOSpec("array", _)):
result = da.from_zarr(f[k])
case _:
result = read_func(f[k])
return result
adata = read_dispatched("adata.zarr", dispatcher=layers_as_dask)
I am not actually 100% sure I've used the pattern matching correctly here.
But the idea is that you can stick some extra logic between disk and memory. I think this is much too flexible, but would be a very quick path to allowing this. Ideally pattern matching would be first class so we could require it's used.
@ivirshup I was looking at this a bit. This would only work for files that follow the key-value paradigm? What would/should the scope of this be do you think? Also do you think string matching to handle the read_func
default (i.e if the file name ends in zarr
or h5ad
) works?
This would only work for files that follow the key-value paradigm?
Not sure I'm completley understanding, but will try to answer:
The idea here is that every time read_elem
would be called, we get to use a callback. The API of the function is flexible, cause I haven't actually gotten this to work yet.
Also do you think string matching to handle the read_func default (i.e if the file name ends in zarr or h5ad) works?
I would actually probably say you need to pass a zarr.Group
or h5py.Group
to read_dispatched
, and not worry about dispatching based on string. This gives you much more flexibility in the creation of zarr.Group
s anyways.
Btw, a corresponding write method would also be very useful. Big use case right now is for controlling chunking.
AnnData selectors
Note: I will be updating this to be easier to read, just wanted it up to share
Use case: Modified IO
I would like to able to specify read time transforms on specific elements of an AnnData object. These transforms would include:
polars.DataFrame
,arrow.Table
,dask.dataframe.DataFrame
sparse
,jax.sparse
dask
delayed
elementsHow do we specify these transforms?
Specification
I think we would want to select elements by their type and location. These would be two separate fields.
IOSpec
mapping of types.Examples
Query by location
Query by type
Query by both
Read objects in lazily to dask arrays
Usage:
Alternatives
Only delayed
Only delayed ops, but this would have to include a way to access data in it's backed format (which does not exist for dataframes, may not exist for future types).
Exclude
Read time modifiers
(will be adding to this)