scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
579 stars 154 forks source link

Initialising anndata on-disk #794

Open jeskowagner opened 2 years ago

jeskowagner commented 2 years ago

Hi,

I am looking to create an AnnData file with pre-specified dimensions and dtypes of X, obs and var. Ultimately, the aim will be to iteratively populate these slots.

There is one issue with this at the moment, and for one other aspect I would appreciate feedback whether this implementation would conform to the problems with this at the moment:

Code example 1:

import anndata as ad
import pandas as pd

# create some dummy obs and var
obs, var = pd.DataFrame([""], columns=["C"]), pd.DataFrame([""],columns=["D"])
adat = ad.AnnData(obs=obs,var=var, filename="test.h5ad", filemode="w")
adat2 = ad.read_h5ad("test.h5ad")

print(adat.shape, adat2.shape)
# (1, 1) (0, 0)

Code example 2:

import anndata as ad
import pandas as pd
import h5py

obs, var = pd.DataFrame([""], columns=["C"]), pd.DataFrame([""],columns=["D"])
adat = ad.AnnData(obs=obs, var=var, filename="test.h5ad", filemode="w")
adat.file.close()

# initialise X
f = h5py.File("test.h5ad", "a")
f.create_dataset(
    "X",
    shape=(100000, 10000),
    dtype="float32"
)
f.close()
adat = ad.read_h5ad("test.h5ad", backed="r+")
adat[0:5,0:5] = 1 # <- triggers large memory allocation
adat.file.close()

My library versions are:

anndata version: 0.8.0
h5py version:    3.6.0
hdf5 version:    1.12.1

And just to answer why I would want to do this in the first place - I am looking at few thousand csv files totalling around 300Gb on disk that need to be joined and converted to h5ad. The data is dense. The only alternative I see is to use dask for reading and concatenation, however, I believe anndata does not currently support creating AnnData objects from dask data.frames (please correct me if I'm wrong).

Appreciate any help! Thanks,

Jesko

jeskowagner commented 2 years ago

To address issue 1, I now used the following workaround:

import h5py
import pandas as pd
from scipy import sparse
from anndata.experimental import write_elem

obs, var = pd.DataFrame([""], columns=["C"]), pd.DataFrame([""],columns=["D"])

with h5py.File(output_file, "a") as target:
    dummy_X = sparse.csr_matrix((obs.shape[0], var.shape[0]), dtype="float32")
    dummy_X.indptr = dummy_X.indptr.astype(
        "int64"
    )  # Guarding against overflow for very large datasets

    write_elem(target, "X", dummy_X)
    write_elem(target, "obs", obs)
    write_elem(target, "var", var)

That seems to survive a round-trip just fine, and can be read in using anndata.

So now the more pressing issue is how to avoid memory overhead, as outlined in issue 2 above.

jeskowagner commented 2 years ago

I believe I solved issue 2 as well. My misunderstanding came from how chunking worked on h5py files and how this interfaces with anndata. My implementation now looks somewhat like this:

with h5py.File(output_file, "w") as target:
    target.create_dataset(
        "X",
        (obs.shape[0], var.shape[0]),
         dtype="float32",
        chunks=(1000, 1000),
    )
    write_elem(target, "obs", obs)
    write_elem(target, "var", var)

# read in created output file
adata = ad.read(output_file, backed="r+")

counter = 0
for f in files:
    cur_X = ... # read in a subset of X
    adata[counter : counter + cur_X.shape[0], :].X = cur_X # store that subset of X
    counter += cur_X.shape[0]

The memory overhead of this seems manageable in my tests.

Really great how anndata interfaces with h5py to enable this low(ish)-level API behaviour! If write_elem could survive release 0.9 and not be removed (which I understand might happen due to its experimental status) that would be fantastic!

jeskowagner commented 2 years ago

I found that while this gets me 90% there, initialising X on-disk as shown in the last example above is not entirely the same as creating an AnnData object normally. Specifically, adata.X will show it's a <HDF5 dataset "X": shape (100, 100), type "<f4"> when created in the above way, whereas the normal behaviour is to print the numpy array. I believe this gives AnnData some trouble when performing for example the following operation (also mentioned in the scanpy tutorial): adata.raw = adata. Hence, I would like to know if there is a way to make X into a fully-fledged AnnData-understood dataset.

jeskowagner commented 2 years ago

Not sure if it helps, but I find that calling adata2 = adata.to_memory will convert the HDF5 dataset into a np.array as I would expect. However, this is not a very feasible solution as I expect my users not to have enough memory to store X in memory. Any way I could trigger this conversion on-disk?

ivirshup commented 2 years ago

Sorry about the late response here! I was out of town all of July and am just catching up on issues now

It sounds like you've got this mostly figured out, but I'll respond to a few points here:

write_elem could survive release 0.9

It will. At worst, it will move from experimental and maybe change name to something like write_element but that probably won't happen for 0.9.

believe this gives AnnData some trouble when performing for example the following operation (also mentioned in the scanpy tutorial): adata.raw = adata

Could you be more specific about this? I'm not sure I understand exactly what's going wrong. A minimal example would be great.

Any way I could trigger this conversion on-disk?

AnnData should be fine with having a h5py.Dataset in X. So I'm unsure of what the conversion should be.

jeskowagner commented 2 years ago

Sorry about the late response here! No problem, thanks for getting back to me!

It will. Fantastic, thank you!

For the problem description: see the example below; it is slightly long, but I hope it is clear.

There are two examples of code that works when creating the dataset "normally" (through AnnData(X=X)) but produce different output when the Dataset is created on-disk first. The first I think is a real error, the second likely just in the presentation of the result.

Reproducible example ```python import h5py import numpy as np import pandas as pd import anndata as ad from anndata.experimental import write_elem output_file="test.h5ad" n_cols = 10 n_rows = 1000 obs = pd.DataFrame({"group":np.arange(n_rows)}) var = pd.DataFrame({"feature":np.arange(n_cols)}) # dummy function for chunked X def get_X_chunk(chunksize=10): X = np.random.random((n_rows,n_cols)) chunk_idx = 0 for i in range(chunksize,len(X)+chunksize,chunksize): yield X[chunk_idx:min(i,len(X))] chunk_idx += chunksize # initialize h5ad on file with h5py.File(output_file, "w") as target: target.create_dataset( "X", (obs.shape[0], var.shape[0]), dtype="float32", chunks=(10, 10), ) write_elem(target, "obs", obs) write_elem(target, "var", var) # read in created output file adata = ad.read(output_file, backed="r+") # write chunks to file counter = 0 for chunk in get_X_chunk(): adata[counter : counter + chunk.shape[0], :].X = chunk # store that subset of X counter += chunk.shape[0] ### Error mode 1: adata.raw = adata # AttributeError: Could not find dataset for raw X in file: test.h5ad. ### Error mode 2: print(adata.X) # # Compare with: print(AnnData(X=np.hstack([*get_X_chunk()])).X) # -> array([[...]] ```

Hope this is more clear!

ivirshup commented 2 years ago

Ah, I think I'm seeing what's going on

For "Error mode 1:" It looks like the issue is setting raw on a backed object since it expects the on disk file to have raw already...

I personally would like to deprecate raw, so I'd need some convincing to prioritize fixing this. Since adata and raw would be the same shape here, could you store this in a layer?

For "Error mode 2", that result is expected to me. Essentially what's happening is that X is a h5py.Dataset.

jeskowagner commented 2 years ago

RE 1: I see your point. Yes, storing it in a layer is possible; it just deviates from the "standard procedures" that e.g. Scanpy establish. But then again, so does creating this file in the first place.

RE 2: I think I might misunderstand - isn't X always a h5py.Dataset (on disk)?

ivirshup commented 2 years ago

isn't X always a h5py.Dataset (on disk)?

When it's a dense array, yes. When loaded into memory it's a numpy array.

In backed mode, it's not loaded into memory, and adata.X is just the h5py.Dataset

jeskowagner commented 2 years ago

Wow, that makes a lot of sense. Sorry, that really should have been obvious to me. Thanks for your patience.

I have one last question: could you provide me some intuition on how to operate on h5py.Datasets correctly? E.g. the following does not produce the output I would expect:

import numpy as np
from anndata import AnnData, read_h5ad

def scaler(x: np.array):
    x -= x.mean()
    x /= x.std()

adata = AnnData(X=np.random.rand(10, 10))
adata.write("test.h5ad")

adata_copy = read_h5ad("test.h5ad",backed="r+")

np.apply_along_axis(scaler, 0, adata.X)
np.apply_along_axis(scaler, 0, adata_copy.X)

print(adata[0,0].X == adata_copy[0,0].X)
# -> False
ivirshup commented 2 years ago

I think this is about how np.apply_along_axis works.

I believe it's essentially doing func(X[slice]). With an ndarray, X[slice] is a view of the input array. With a hdf5 dataset, X[slice] reads an ndarray out of the hdf5 store – so it's acting like a copy.

I think these would work instead:

def scaler(X):
    for i in X.shape[1]:
        X[:, i] -= X[:, i].mean()
        X[:, i] /= X[:, i].std()

# This might be better cause you read fewer times
def scaler(X):
    for i in X.shape[1]:
        x = X[:, i]
        x -= x.mean()
        x /= x.std()
        X[:, i] = x

But this may not be the fastest, since you probably want something optimized for chunk based access.

jeskowagner commented 2 years ago

You're an absolute hero, thanks a million! With a tiny fix (range) this indeed passes this test:

import numpy as np
from anndata import AnnData, read_h5ad
from memory_profiler import profile

@profile
def scaler(X):
    for i in range(X.shape[1]):
        x = X[:, i]
        x -= x.mean()
        x /= x.std()
        X[:, i] = x

adata = AnnData(X=np.random.rand(100000, 1000))
adata.write("test.h5ad")

adata_copy = read_h5ad("test.h5ad",backed="r+")

scaler(adata.X)
scaler(adata_copy.X)

print(adata[0,0].X == adata_copy[0,0].X)
# True

As you say, it is much slower than it could/should be. But I guess to accelerate it one would need what we discussed over at Discourse. Or maybe some form of accessing the chunksize on disk and optimizing functions to that chunksize (e.g. adata.chunksize). This latter approach is likely more feasible, but would also benefit from some form of controlling chunking to disk through AnnData's write. Probably that's overthinking it, though(?)

Just to place this back into context - the advantage of this approach is to deal with very large datasets out of memory. While slower, this does empower users of single-cell packages to run their data on a local machine, which is often preferred in my experience.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!