ivirshup commented 2 years ago

AnnData selectors

Note: I will be updating this to be easier to read, just wanted it up to share

Use case: Modified IO

I would like to able to specify read time transforms on specific elements of an AnnData object. These transforms would include:

Read in as an alternate class, examples:
- dataframe: polars.DataFrame, arrow.Table, dask.dataframe.DataFrame
- sparse array: pydata sparse, jax.sparse
- array: dask
Include/ exclude
Sparsify
Densify
Lazily read -> delayed elements
Partial read
- E.g. read only the first hundred obs

How do we specify these transforms?

Specification

I think we would want to select elements by their type and location. These would be two separate fields.

location can be a one or multiples paths to locations in anndata
type will refer to the type of the object. This could be either with our IOSpec mapping of types.

Examples

Query by location

{
    "select": {"location": "obsm/*"},
    "modifier": Exclude()
}

Query by type

{
    "select": {"type": pd.DataFrame},
    "modifier": AsPolars(),
}

Query by both

{
    "select": {"type": "array", "location": ["layers", "obsm"]}
    "modifier": Exclude()
}

Read objects in lazily to dask arrays

{
    "select": {"location": "*"},
    "modifer": AsDask()
}

Usage:

adata = ad.read_h5ad(
    "path/to/file.h5ad",
    modifiers = [
        {
            "select": {"loc": ["X", "obs", "obsm/X_umap"]},
            "modifier": Include(),
        },
        {"select": {"loc": "X"}, "modifier": AsCSC()},
    ]
)

Alternatives

Only delayed

Only delayed ops, but this would have to include a way to access data in it's backed format (which does not exist for dataframes, may not exist for future types).

Exclude

adata = ad.read_as_dask("path/to/file.h5ad")
adata = adata.select_elements(["X", "obs", "obsm/X_umap"])
adata.to_memory()

Read time modifiers

adata = ad.read_backed("path/to/file.h5ad")
adata.X = ad.io.utils.read_dense_as_csr(adata.X)
adata.to_memory()

(will be adding to this)

ivirshup commented 2 years ago

Related: zarr PR for reading in data to a specific array type. The particular use case being GPU arrays https://github.com/zarr-developers/zarr-python/pull/934.

ivirshup commented 2 years ago

A more flexible (and definitely error prone) way to do this would be to allow users to pass a dispatch function to be used during IO. What that function might look like:

from anndata.io import read_dispatched, IOSpec
import dask.array as da

def layers_as_dask(read_func, f, k: str, spec: IOSpec):
    """
    Params
    ------
    read_func
        Default function for reading f[k] based on its IOSpec
    f
        File to read from
    k
        Key
    spec
        IOSpec for f[k]
    """
    match (k.split("/"), spec):
        case (["layers", _], IOSpec("array", _)):
            result = da.from_zarr(f[k])
        case _:
            result = read_func(f[k])
    return result

adata = read_dispatched("adata.zarr", dispatcher=layers_as_dask)

I am not actually 100% sure I've used the pattern matching correctly here.

But the idea is that you can stick some extra logic between disk and memory. I think this is much too flexible, but would be a very quick path to allowing this. Ideally pattern matching would be first class so we could require it's used.

ilan-gold commented 1 year ago

@ivirshup I was looking at this a bit. This would only work for files that follow the key-value paradigm? What would/should the scope of this be do you think? Also do you think string matching to handle the read_func default (i.e if the file name ends in zarr or h5ad) works?

ivirshup commented 1 year ago

This would only work for files that follow the key-value paradigm?

Not sure I'm completley understanding, but will try to answer:

The idea here is that every time read_elem would be called, we get to use a callback. The API of the function is flexible, cause I haven't actually gotten this to work yet.

Also do you think string matching to handle the read_func default (i.e if the file name ends in zarr or h5ad) works?

I would actually probably say you need to pass a zarr.Group or h5py.Group to read_dispatched, and not worry about dispatching based on string. This gives you much more flexibility in the creation of zarr.Groups anyways.

ivirshup commented 1 year ago

Btw, a corresponding write method would also be very useful. Big use case right now is for controlling chunking.

scverse / anndata

AnnData IO modifiers #659

AnnData selectors

Use case: Modified IO

Specification

Examples

Query by location

Query by type

Query by both

Read objects in lazily to dask arrays

Usage:

Alternatives

Only delayed

Exclude

Read time modifiers