scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
563 stars 152 forks source link

Dask and Zarr not loading obsp and obsm from remote s3 #951

Open will-moore opened 1 year ago

will-moore commented 1 year ago

Hi,

I'm using @ilan-gold's nice sample code at https://anndata.readthedocs.io/en/latest/tutorials/notebooks/%7Bread,write%7D_dispatched.html to read remote anndata, which is working great when I'm serving data locally via http.

But when I'm using minio to serve the data, I'm not getting the obsp, obsm or uns parts of the AnnData object. See code sample below.

A UI view of the data is at https://deploy-preview-20--ome-ngff-validator.netlify.app/?source=https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/regions_table/ (from https://github.com/ome/ome-ngff-validator/pull/20) where I'm using the extra "keys" in e.g. https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/regions_table/obsm/.zattrs to load those groups:

{
    "encoding-type": "dict",
    "encoding-version": "0.1.0",
    "keys": [
        "X_scanorama",
        "X_umap",
        "spatial"
    ]
}

Is there any way I can use those 'keys' to load obsp and obsm data?

Thanks!

from typing import Callable, Union
import dask.array as da
import zarr
from anndata import AnnData
from anndata._io.specs import IOSpec
from anndata.compat import H5Array, H5Group, ZarrArray, ZarrGroup

# ** requires anndata==0.9.0.rc1
from anndata.experimental import read_dispatched, read_elem
from zarr.storage import FSStore
from ome_zarr.io import parse_url
StorageType = Union[H5Array, H5Group, ZarrArray, ZarrGroup]

def read_remote_anndata(store: FSStore, name: str) -> AnnData:
    table_group = zarr.group(store=store, path=name)

    def callback(
        func: Callable, elem_name: str, elem: StorageType, iospec: IOSpec
    ) -> AnnData:
        if iospec.encoding_type in (
            "dataframe",
            "csr_matrix",
            "csc_matrix",
            "awkward-array",
        ):
            # Preventing recursing inside of these types
            return read_elem(elem)
        elif iospec.encoding_type == "array":
            return da.from_zarr(elem)
        else:
            return func(elem)

    adata = read_dispatched(table_group, callback=callback)
    return adata

url = "https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/"
name = "regions_table"
store = parse_url(url, mode="r").store
anndata_obj = read_remote_anndata(store, name)

print('anndata_obj', anndata_obj)

# anndata_obj AnnData object with n_obs × n_vars = 1045 × 36
#     obs: 'row_num', 'point', 'cell_id', 'X1', 'center_rowcoord', 'center_colcoord', 'cell_size', 'category', 'donor', 'Cluster', 'batch', 'library_id'
#     var: 'mean-0', 'std-0', 'mean-1', 'std-1', 'mean-2', 'std-2'

url = "http://localhost:8000/test_segment.zarr/tables/"
store = parse_url(url, mode="r").store
anndata_obj = read_remote_anndata(store, name)

print('anndata_obj', anndata_obj)

# anndata_obj AnnData object with n_obs × n_vars = 1045 × 36
#     obs: 'row_num', 'point', 'cell_id', 'X1', 'center_rowcoord', 'center_colcoord', 'cell_size', 'category', 'donor', 'Cluster', 'batch', 'library_id'
#     var: 'mean-0', 'std-0', 'mean-1', 'std-1', 'mean-2', 'std-2'
#     uns: 'Cluster_colors', 'batch_colors', 'neighbors', 'spatial', 'umap'
#     obsm: 'X_scanorama', 'X_umap', 'spatial'
#     obsp: 'connectivities', 'distances'
ilan-gold commented 1 year ago

@will-moore Is the store consolidated? I get a 200 on https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/regions_table/.zgroup but not on https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/regions_table/.zmetadata

ilan-gold commented 1 year ago

If not, I should probably make that more obvious as an essential step in the tutorial. @ivirshup That is actually something I've been meaning to mention - we should be able to resolve as @will-moore does the keys of obsm etc. without consolidating metadata at a slight penalty. Should we implement this?

will-moore commented 1 year ago

Ah, right - that's coming back to the discussion at https://github.com/kevinyamauchi/ome-ngff-tables-prototype/pull/12 where we preferred consolidated metadata not to be part of the NGFF spec. cc @joshmoore

ilan-gold commented 1 year ago

I mean, y'all are big users and the feature probably makes sense anyway so we probably should implement it. But the penalty is of course a bit higher since you need to open each .zarray for each column.

will-moore commented 1 year ago

By "implement it" you mean support for "keys"? Would this need to happen in anndata itself, or is there a way to add it in the example above?

This is code I'm adding to https://github.com/ome/ome-zarr-py/pull/256 so all NGFF python users will be able to use it.

ilan-gold commented 1 year ago

I am not sure @will-moore but "implement it" means using the trick you use in the browser to discover columns, whether that's in this example or in general. At the moment, we iterate over the keys of the incoming obsm object (a dict) to read it and consolidated metadata lets you have those keys.

will-moore commented 1 year ago

I seem to have got the loading of obsm working, using the custom keys (see below), but obsp is beyond what I can get my head around just now...

I don't know if I'm getting close, or if this is really a lot more complex and we should resign ourselves to using consolidate_metadata if we want this remote access to work?

from typing import Callable, Union
import json

import dask.array as da
import zarr
from anndata import AnnData
from anndata._io.specs import IOSpec
from anndata.compat import H5Array, H5Group, ZarrArray, ZarrGroup

# ** requires anndata==0.9.0.rc1
from anndata.experimental import read_dispatched, read_elem
from zarr.storage import FSStore

from ome_zarr.io import parse_url

StorageType = Union[H5Array, H5Group, ZarrArray, ZarrGroup]

# From https://anndata.readthedocs.io/en/latest/tutorials/notebooks/%7Bread,write%7D_dispatched.html

def read_remote_anndata(store: FSStore, name: str) -> AnnData:
    table_group = zarr.group(store=store, path=name)

    def callback(
        func: Callable, elem_name: str, elem: StorageType, iospec: IOSpec
    ) -> AnnData:
        print("el", elem_name, iospec.encoding_type, elem)
        if iospec.encoding_type in (
            "dataframe",
            "csr_matrix",
            "csc_matrix",
            "awkward-array",
        ):
            # Preventing recursing inside of these types
            return read_elem(elem)
        elif iospec.encoding_type == "array":
            return da.from_zarr(elem)
        elif iospec.encoding_type == "dict" and "obsm" in elem_name:    # or "obsp" in elem_name:
            # load .zattrs
            attrs = store.get(elem_name + "/.zattrs")
            attrs = json.loads(attrs)
            print("attrs", attrs)
            if "keys" in attrs:
                to_return = {}
                for key in attrs["keys"]:
                    print("URL", elem_name + "/" + key)
                    try:
                        arr = da.from_zarr(store, elem_name + "/" + key)
                        to_return[key] = arr
                    except:
                        print("Failed " + elem_name + "/" + key)
                        pass
                return to_return
        # not handled above, call func()
        rsp = func(elem)
        return rsp

    adata = read_dispatched(table_group, callback=callback)
    return adata

url = "https://minio-dev.openmicroscopy.org/idr/temp_table/test_segment.zarr/tables/"
name = "regions_table"
store = parse_url(url, mode="r").store
anndata_obj = read_remote_anndata(store, name)

print('anndata_obj', anndata_obj)
github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!

flying-sheep commented 1 year ago

@ivirshup what do you think about this?

@ivirshup That is actually something I've been meaning to mention - we should be able to resolve as @will-moore does the keys of obsm etc. without consolidating metadata at a slight penalty. Should we implement this?

ilan-gold commented 1 year ago

@flying-sheep I am doing this in the upcoming PR already #947

will-moore commented 1 year ago

Thanks all for following up on this 👍

flying-sheep commented 1 year ago

Should we mark that PR as Fixes #951 then?

ilan-gold commented 1 year ago

Ah, good call. I'll add it.

ilan-gold commented 9 months ago

Hi @will-moore as an alternative to https://github.com/scverse/anndata/pull/947 you can also try out https://github.com/scverse/anndata/tree/ig/xarray_compat (or https://github.com/scverse/anndata/pull/1247) which is a bit lighter weight, but similar with a read_backed method. This returns a normal AnnData object but with xarray.

will-moore commented 9 months ago

Thanks for the update @ilan-gold - Unfortunately I'm not looking to include this work in ome-zarr-py now since the spec proposal isn't going ahead just now https://github.com/ome/ngff/pull/64