scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
535 stars 150 forks source link

AnnData IO modifiers #659

Open ivirshup opened 2 years ago

ivirshup commented 2 years ago

AnnData selectors

Note: I will be updating this to be easier to read, just wanted it up to share

Use case: Modified IO

I would like to able to specify read time transforms on specific elements of an AnnData object. These transforms would include:

How do we specify these transforms?

Specification

I think we would want to select elements by their type and location. These would be two separate fields.

Examples

Query by location

{
    "select": {"location": "obsm/*"},
    "modifier": Exclude()
}

Query by type

{
    "select": {"type": pd.DataFrame},
    "modifier": AsPolars(),
}

Query by both

{
    "select": {"type": "array", "location": ["layers", "obsm"]}
    "modifier": Exclude()
}

Read objects in lazily to dask arrays

{
    "select": {"location": "*"},
    "modifer": AsDask()
}

Usage:

adata = ad.read_h5ad(
    "path/to/file.h5ad",
    modifiers = [
        {
            "select": {"loc": ["X", "obs", "obsm/X_umap"]},
            "modifier": Include(),
        },
        {"select": {"loc": "X"}, "modifier": AsCSC()},
    ]
)

Alternatives

Only delayed

Only delayed ops, but this would have to include a way to access data in it's backed format (which does not exist for dataframes, may not exist for future types).

Exclude

adata = ad.read_as_dask("path/to/file.h5ad")
adata = adata.select_elements(["X", "obs", "obsm/X_umap"])
adata.to_memory()

Read time modifiers

adata = ad.read_backed("path/to/file.h5ad")
adata.X = ad.io.utils.read_dense_as_csr(adata.X)
adata.to_memory()

(will be adding to this)

ivirshup commented 2 years ago

Related: zarr PR for reading in data to a specific array type. The particular use case being GPU arrays https://github.com/zarr-developers/zarr-python/pull/934.

ivirshup commented 2 years ago

A more flexible (and definitely error prone) way to do this would be to allow users to pass a dispatch function to be used during IO. What that function might look like:

from anndata.io import read_dispatched, IOSpec
import dask.array as da

def layers_as_dask(read_func, f, k: str, spec: IOSpec):
    """
    Params
    ------
    read_func
        Default function for reading f[k] based on its IOSpec
    f
        File to read from
    k
        Key
    spec
        IOSpec for f[k]
    """
    match (k.split("/"), spec):
        case (["layers", _], IOSpec("array", _)):
            result = da.from_zarr(f[k])
        case _:
            result = read_func(f[k])
    return result

adata = read_dispatched("adata.zarr", dispatcher=layers_as_dask)

I am not actually 100% sure I've used the pattern matching correctly here.

But the idea is that you can stick some extra logic between disk and memory. I think this is much too flexible, but would be a very quick path to allowing this. Ideally pattern matching would be first class so we could require it's used.

ilan-gold commented 1 year ago

@ivirshup I was looking at this a bit. This would only work for files that follow the key-value paradigm? What would/should the scope of this be do you think? Also do you think string matching to handle the read_func default (i.e if the file name ends in zarr or h5ad) works?

ivirshup commented 1 year ago

This would only work for files that follow the key-value paradigm?

Not sure I'm completley understanding, but will try to answer:

The idea here is that every time read_elem would be called, we get to use a callback. The API of the function is flexible, cause I haven't actually gotten this to work yet.

Also do you think string matching to handle the read_func default (i.e if the file name ends in zarr or h5ad) works?

I would actually probably say you need to pass a zarr.Group or h5py.Group to read_dispatched, and not worry about dispatching based on string. This gives you much more flexibility in the creation of zarr.Groups anyways.

ivirshup commented 1 year ago

Btw, a corresponding write method would also be very useful. Big use case right now is for controlling chunking.