Open jeskowagner opened 2 years ago
To address issue 1, I now used the following workaround:
import h5py
import pandas as pd
from scipy import sparse
from anndata.experimental import write_elem
obs, var = pd.DataFrame([""], columns=["C"]), pd.DataFrame([""],columns=["D"])
with h5py.File(output_file, "a") as target:
dummy_X = sparse.csr_matrix((obs.shape[0], var.shape[0]), dtype="float32")
dummy_X.indptr = dummy_X.indptr.astype(
"int64"
) # Guarding against overflow for very large datasets
write_elem(target, "X", dummy_X)
write_elem(target, "obs", obs)
write_elem(target, "var", var)
That seems to survive a round-trip just fine, and can be read in using anndata.
So now the more pressing issue is how to avoid memory overhead, as outlined in issue 2 above.
I believe I solved issue 2 as well. My misunderstanding came from how chunking worked on h5py files and how this interfaces with anndata. My implementation now looks somewhat like this:
with h5py.File(output_file, "w") as target:
target.create_dataset(
"X",
(obs.shape[0], var.shape[0]),
dtype="float32",
chunks=(1000, 1000),
)
write_elem(target, "obs", obs)
write_elem(target, "var", var)
# read in created output file
adata = ad.read(output_file, backed="r+")
counter = 0
for f in files:
cur_X = ... # read in a subset of X
adata[counter : counter + cur_X.shape[0], :].X = cur_X # store that subset of X
counter += cur_X.shape[0]
The memory overhead of this seems manageable in my tests.
Really great how anndata interfaces with h5py to enable this low(ish)-level API behaviour! If write_elem
could survive release 0.9
and not be removed (which I understand might happen due to its experimental status) that would be fantastic!
I found that while this gets me 90% there, initialising X on-disk as shown in the last example above is not entirely the same as creating an AnnData object normally. Specifically, adata.X
will show it's a <HDF5 dataset "X": shape (100, 100), type "<f4">
when created in the above way, whereas the normal behaviour is to print the numpy array. I believe this gives AnnData some trouble when performing for example the following operation (also mentioned in the scanpy tutorial): adata.raw = adata
. Hence, I would like to know if there is a way to make X
into a fully-fledged AnnData-understood dataset.
Not sure if it helps, but I find that calling adata2 = adata.to_memory
will convert the HDF5 dataset
into a np.array
as I would expect. However, this is not a very feasible solution as I expect my users not to have enough memory to store X
in memory. Any way I could trigger this conversion on-disk?
Sorry about the late response here! I was out of town all of July and am just catching up on issues now
It sounds like you've got this mostly figured out, but I'll respond to a few points here:
write_elem could survive release 0.9
It will. At worst, it will move from experimental and maybe change name to something like write_element
but that probably won't happen for 0.9
.
believe this gives AnnData some trouble when performing for example the following operation (also mentioned in the scanpy tutorial): adata.raw = adata
Could you be more specific about this? I'm not sure I understand exactly what's going wrong. A minimal example would be great.
Any way I could trigger this conversion on-disk?
AnnData
should be fine with having a h5py.Dataset
in X
. So I'm unsure of what the conversion should be.
Sorry about the late response here! No problem, thanks for getting back to me!
It will. Fantastic, thank you!
For the problem description: see the example below; it is slightly long, but I hope it is clear.
There are two examples of code that works when creating the dataset "normally" (through AnnData(X=X)
) but produce different output when the Dataset is created on-disk first. The first I think is a real error, the second likely just in the presentation of the result.
Hope this is more clear!
Ah, I think I'm seeing what's going on
For "Error mode 1:" It looks like the issue is setting raw
on a backed object since it expects the on disk file to have raw
already...
I personally would like to deprecate raw, so I'd need some convincing to prioritize fixing this. Since adata
and raw
would be the same shape here, could you store this in a layer?
For "Error mode 2", that result is expected to me. Essentially what's happening is that X
is a h5py.Dataset
.
RE 1: I see your point. Yes, storing it in a layer is possible; it just deviates from the "standard procedures" that e.g. Scanpy establish. But then again, so does creating this file in the first place.
RE 2: I think I might misunderstand - isn't X always a h5py.Dataset (on disk)?
isn't X always a h5py.Dataset (on disk)?
When it's a dense array, yes. When loaded into memory it's a numpy array.
In backed mode, it's not loaded into memory, and adata.X
is just the h5py.Dataset
Wow, that makes a lot of sense. Sorry, that really should have been obvious to me. Thanks for your patience.
I have one last question: could you provide me some intuition on how to operate on h5py.Dataset
s correctly? E.g. the following does not produce the output I would expect:
import numpy as np
from anndata import AnnData, read_h5ad
def scaler(x: np.array):
x -= x.mean()
x /= x.std()
adata = AnnData(X=np.random.rand(10, 10))
adata.write("test.h5ad")
adata_copy = read_h5ad("test.h5ad",backed="r+")
np.apply_along_axis(scaler, 0, adata.X)
np.apply_along_axis(scaler, 0, adata_copy.X)
print(adata[0,0].X == adata_copy[0,0].X)
# -> False
I think this is about how np.apply_along_axis works.
I believe it's essentially doing func(X[slice])
. With an ndarray
, X[slice]
is a view of the input array. With a hdf5
dataset, X[slice]
reads an ndarray out of the hdf5 store – so it's acting like a copy.
I think these would work instead:
def scaler(X):
for i in X.shape[1]:
X[:, i] -= X[:, i].mean()
X[:, i] /= X[:, i].std()
# This might be better cause you read fewer times
def scaler(X):
for i in X.shape[1]:
x = X[:, i]
x -= x.mean()
x /= x.std()
X[:, i] = x
But this may not be the fastest, since you probably want something optimized for chunk based access.
You're an absolute hero, thanks a million! With a tiny fix (range
) this indeed passes this test:
import numpy as np
from anndata import AnnData, read_h5ad
from memory_profiler import profile
@profile
def scaler(X):
for i in range(X.shape[1]):
x = X[:, i]
x -= x.mean()
x /= x.std()
X[:, i] = x
adata = AnnData(X=np.random.rand(100000, 1000))
adata.write("test.h5ad")
adata_copy = read_h5ad("test.h5ad",backed="r+")
scaler(adata.X)
scaler(adata_copy.X)
print(adata[0,0].X == adata_copy[0,0].X)
# True
As you say, it is much slower than it could/should be. But I guess to accelerate it one would need what we discussed over at Discourse. Or maybe some form of accessing the chunksize on disk and optimizing functions to that chunksize (e.g. adata.chunksize
). This latter approach is likely more feasible, but would also benefit from some form of controlling chunking to disk through AnnData's write
. Probably that's overthinking it, though(?)
Just to place this back into context - the advantage of this approach is to deal with very large datasets out of memory. While slower, this does empower users of single-cell packages to run their data on a local machine, which is often preferred in my experience.
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!
Hi,
I am looking to create an AnnData file with pre-specified dimensions and dtypes of
X
,obs
andvar
. Ultimately, the aim will be to iteratively populate these slots.There is one issue with this at the moment, and for one other aspect I would appreciate feedback whether this implementation would conform to the problems with this at the moment:
obs
andvar
on disk.X
does not fit in memory, so I plan to create the datasetX
manually using the ideas discussed here. For this I will createX
and populate it with an empty dataset of the right dtype (see Code example 2). My question is whether this will set all attributes correctly or whether I might miss anything by creating the dataset directly. Also, I think when I manipulate a subset ofX
values that the entire dataset gets loaded into memory (?) and so this strategy might not work. If you have any other suggestions I would appreciate help here.Code example 1:
Code example 2:
My library versions are:
And just to answer why I would want to do this in the first place - I am looking at few thousand csv files totalling around 300Gb on disk that need to be joined and converted to h5ad. The data is dense. The only alternative I see is to use dask for reading and concatenation, however, I believe anndata does not currently support creating AnnData objects from dask data.frames (please correct me if I'm wrong).
Appreciate any help! Thanks,
Jesko