scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
522 stars 149 forks source link

HDF5 Cloud Storage #634

Open jreadey opened 2 years ago

jreadey commented 2 years ago

Is there interest in storing data in the cloud? E.g. using AWS S3. With HDF5 this is problematic since h5py requires a posix-based filesystem. I maintain the h5pyd project (https://github.com/HDFGroup/h5pyd) which gets around this by providing a h5py compatible api to a sharded data store (similar to zarr). I think it should be possible to have h5ad support either h5py or h5pyd, but first wanted to gauge interest in this approach. Thanks!

ivirshup commented 2 years ago

Hey! We definitely have interest in cloud based storage, but so far have been largely eyeing Zarr and possibly arrow for this. This looks interesting. My initial thoughts compared to the other formats:

Strengths:

Weaknesses:

Needs investigation:

Could be of interest to @ryan-williams, @ilan-gold, @joshua-gould

ilan-gold commented 2 years ago

@ivirshup I haven't had time to dig into this on the AnnData side yet, but a good starting point here for remote data (wihtout switching to a new hdf5 reader, which is what what this issue seems to be about) would be to literally just try passing in a URL to a zarr-backed AnnData store to the AnnDataconstructor since (I think) zarr supports URLs natively. For example, this line makes me think that this should just work out of the box: https://github.com/theislab/anndata/blob/0ec97410702f61b29acef80b6377d1232699fa94/anndata/_io/zarr.py#L252

So for example, if you have an AnnData store you could do

adata.write_zarr('path/to/local/my_store.zarr')

Then in a shell (since I don't know offhand in python)

gsutil cp -r path/to/local/my_store.zarr gs://my_bucket/

And then

import anndata as ad
import aiohttp, requests, zarr, fsspec
adata_cloud = ad.read_zarr(my_google_url)

This doesn't seem to work for my examples (I get an empty AnnData store from https://storage.googleapis.com/vitessce-demo-data/anndata-test/pbmc3k_processed.zarr although there is no error) but this would be the idea if it worked. Maybe I missed something though. I'll do some more digging. Despite this apparently not working with remote zarr stores (I really think I am doing something wrong, or there is only a small change we need), we have read them in via a custom loaders in the browser via zarr.js. For example https://portal.hubmapconsortium.org/browse/dataset/ea4cfecb8495b36694d9a951510dc3c6 uses remote AnnData stores written to zarr for visualization.

Just wanted to chime in. The TL;DR is basically that zarr supports remote data in theory and my inability to get it working here probably has more to do with my inexperience with zarr than anything AnnData specific. I'll try to look into this more.

jreadey commented 2 years ago

It seems like it's more or less drop in. I was a bit confused with the dispatching logic at first, but my naive attempt seems to work: https://github.com/HDFGroup/anndata/commit/1a4833fed8d9cdeec1d297e224fbeffc030dc304, at least for something like:

import numpy as np
import scipy as sp
import scipy.sparse
from anndata import AnnData, read_h5ad
A = np.random.rand(400, 30)
A[A<0.7]=0
A = sp.sparse.csr_matrix(A)
ad = AnnData(A)
ad.write('hdf5://home/john/anndata/sparse_dataset.h5ad', force_dense=True)

I use the "hdf5://" prefix on the filename to indicate that this is meant to be written to the server rather than a local HDF5 file. After this runs I can do: $ hsls -r /home/john/anndata/sparse_dense_dataset.h5ad (hsls is the server equivalent to h5ls) and get:

/ Group
/X Dataset {400, 30}
/obs Group
/obs/_index Dataset {400}
/var Group
/var/_index Dataset {30}

I haven't tried running any of the benchmarks, but likely it would significantly slower than writing to local HDF5 files since there's extra latency in making off-box requests to the server. Benefit is that you can target AWS S3, Azure Blob, or other object storage systems. Since the server (HSDS) mediates access to the storage system, users don't need to have credentials to a cloud provider, just a username/password with the service. Also, if the client is running outside the cloud provider, there's less data movement since only actual read and write selections need to be transferred (rather than entire files).

You do need to have the service running - it can be setup on Docker or Kubernetes. Fairly easy to install and scale up or down based on usage requirements.

Let me know if this seems interesting to anyone. If so I'd be happy to flesh out the h5pyd integration.

ivirshup commented 2 years ago

Sorry for the long response time here!

I would really like to see this functionality integrated upstream. Is the goal here for hsds to become something like another driver for hdf5?

Getting AnnData to be more cloud friendly is a current priority – so it's under active development. Until we have nailed down our use cases and usage patterns I think I would prefer to keep a smaller number of backends supported in this library, just so there's less to migrate and support while this evolves. Especially if the end goal for this particular backend is to be available through an API we already support.

jreadey commented 2 years ago

No problem! Thanks for getting back on this.

The goal of HSDS is to support the use of HDF in a cloud-native context. This means having a REST-based API, ability to run in distributed systems (e.g. Kubernetes), dynamically scale, and to work well with object storage backends (e.g. S3). Since it's based on the HDF5 data model, it's relatively simple for h5pyd to support most of h5py's api.

So it's something less than an entirely new backend for anndata - see the PR. We pull in both the h5py and h5pyd packages and then tweak the dispatch logic based on if we are dealing with HDF5 files or HSDS server.

BTW, I've been thinking about sparse data support in HSDS recently and judging from the example above, that's important for AnnData. It would be interesting to think about sparse-specific methods in h5pyd/HSDS. (Of course that would make the backend logic more complicated, but I don't think we'll see sparse data methods in h5py anytime soon)

djarecka commented 1 year ago

Hi, I was wondering if there is any update on this issue. I have a big file on S3 and I would love to be able to read it directly, as for example in h5py using ros3, see here This is read-only access, but I believe it would help in many use cases.

flying-sheep commented 5 months ago

@djarecka found the

remfile library that provides a file-like object for reading a remote file over HTTP, optimized for use with h5py

@Koncopd said in https://github.com/scverse/anndata/pull/1322#issuecomment-1905727154 that s3fs works well.

I think we should

  1. Check what API we want. Maybe we just accept URIs and allow configuring handlers for certain URI schemes?
  2. If we commit to directly supporting something, we should benchmark remfile and s3fs. The latter is more general, but why not support the more efficient option?
ivirshup commented 5 months ago

@djarecka

Hi, I was wondering if there is any update on this issue. I have a big file on S3 and I would love to be able to read it directly, as for example in h5py using ros3, see here

Right now, you can do:

import anndata as ad
import h5py
import remfile

ADATA_URI = "https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/expression_matrices/WMB-10Xv2/20230630/WMB-10Xv2-TH-log2.h5ad"

file_h5 = h5py.File(remfile.File(ADATA_URI), "r")

# Read the whole file
adata = ad.experimental.read_elem(file_h5)

# Read the file like "backed"
# This is specialized to X, but you could put the `SparseDataset` or even `h5py.Dataset` anywhere in the object
def read_w_sparse_dataset(group: "h5py.Group | zarr.Group") -> ad.AnnData:
    return ad.AnnData(
        X=ad.experimental.sparse_dataset(group["X"]),
        **{
            k: ad.experimental.read_elem(group[k]) if k in group else {}
            for k in ["layers", "obs", "var", "obsm", "varm", "uns", "obsp", "varp"]
        }
    )

adata = read_w_sparse_dataset(file_h5)
adata.X
# CSRDataset: backend hdf5, shape (131212, 32285), data_dtype float32

@flying-sheep

Maybe we just accept URIs and allow configuring handlers for certain URI schemes?

See also