scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
524 stars 150 forks source link

Add path parameter to write_zarr method #1548

Open antoinegaston opened 4 days ago

antoinegaston commented 4 days ago

Please describe your wishes and possible alternatives to achieve the desired result.

This feature would allow to write the AnnData object to a specific path in a zarr store. It requires very slight changes:

In anndata/_io/zarr.py first

def write_zarr(
    store: MutableMapping | str | Path,
    adata: AnnData,
    path: str | None = None,
    chunks=None,
    **ds_kwargs,
) -> None:
    if isinstance(store, Path):
        store = str(store)
    adata.strings_to_categoricals()
    if adata.raw is not None:
        adata.strings_to_categoricals(adata.raw.var)
    # TODO: Use spec writing system for this
    f = zarr.open(store, mode="w")
    f.attrs.setdefault("encoding-type", "anndata")
    f.attrs.setdefault("encoding-version", "0.1.0")

    def callback(func, s, k, elem, dataset_kwargs, iospec):
        if chunks is not None and not isinstance(elem, sparse.spmatrix) and k == "/X":
            func(s, k, elem, dataset_kwargs=dict(chunks=chunks, **dataset_kwargs))
        else:
            func(s, k, elem, dataset_kwargs=dataset_kwargs)

    write_dispatched(f, f"/{path}", adata, callback=callback, dataset_kwargs=ds_kwargs)

In anndata/_core/anndata.py:

class AnnData(metaclass=utils.DeprecationMixinMeta):
    ...
    def write_zarr(
        self,
        store: MutableMapping | PathLike,
        path: str | None = None,
        chunks: bool | int | tuple[int, ...] | None = None,
    ):
        """\
        Write a hierarchical Zarr array store.

        Parameters
        ----------
        store
            The filename, a :class:`~typing.MutableMapping`, or a Zarr storage class.
        path
            Path within the store at which to write the data.
        chunks
            Chunk shape.
        """
        from .._io import write_zarr

        write_zarr(store, self, path=path, chunks=chunks)

And finally adding a small test to test_readwrite.py:

def test_zarr_path(tmp_path):
    zarr_pth = Path(tmp_path) / "test.zarr"
    adata = gen_adata((100, 100), X_type=np.array)
    adata.write_zarr(zarr_pth, path="test")

    from_zarr = ad.read_zarr(zarr_pth / "test")
    assert_equal(from_zarr, adata)
ilan-gold commented 4 days ago

Hi @antoinegaston could you elaborate a bit your use-case? From what I can see, this seems like quite an unsafe operation.

f = zarr.open(store, mode="w")
f.attrs.setdefault("encoding-type", "anndata")
f.attrs.setdefault("encoding-version", "0.1.0")

you open the store and encode the anndata version/type at the root.

write_dispatched(f, f"/{path}", adata, callback=callback, dataset_kwargs=ds_kwargs)

then you write out to a different location? how would you read this back in? Just want to understand! Like, why not just pass in the store at the location you want it?

antoinegaston commented 4 days ago

Hello @ilan-gold thank you for your comment, indeed I missed to pass the path to path parameter in the zarr.open:

f = zarr.open(store, mode="w", path=path)

To give you more context about the use case, we have a zarr store in which we store not only anndata but other things as well so we wanted to be able to do so without having to create multiple stores targeting the different subpath. We want to keep the flexibility to use choose the kind of store tho' without having to multiply the number of parameters to pass to our processing function. It's just the idea of passing path parameter from zarr.open through the write_zarr method.

ilan-gold commented 4 days ago

@antoinegaston But I believe you can pass in a store of your own into write_zarr as things stand, no? So you could use fsspec to create a store at a location and then pass that in?

antoinegaston commented 4 days ago

Yes you are write, the issue is that in our case we have a global store that is an ABSStore and we create a root group in it in which we create some other groups and where we want to write our anndata object as a group as well. The thing is that the store of all those groups is still the global one and you cannot specifies the path directly to write_zarr as it's a path within a remote storage. Tell me if it's unclear.

ilan-gold commented 4 days ago

Does something like (not exactly, perhas)

# Combine the original store path with the sub-path
new_path = f"{original_store.path}/{sub_path}"

# Open a new ABSStore at the sub-path
new_store = ABSStore(container_name=new_path)

not work?

A short example would be clarifying.

antoinegaston commented 4 days ago

It does the trick indeed but it's not always an ABSStore, it depends on the type of the global parent store. It can be DirectoryStore as well in some situations.

ilan-gold commented 4 days ago
# Combine the original store path with the sub-path
new_path = f"{original_store.path}/{sub_path}"

# Open a new ABSStore at the sub-path
new_store = type(OriginalStore)(container_name=new_path)

Or similar. I don't know, I don't think adding an argument here makes sense. I think the solution here would be to allow passing a zarr.Group if that doesn't already work (which it very well might - I think zarr.open is idempotent)