scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
561 stars 150 forks source link

Writing a h5py.Dataset loads the whole thing into memory #1623

Open ivirshup opened 2 weeks ago

ivirshup commented 2 weeks ago

Please make sure these conditions are met

Report

Code:

%load_ext memory_profiler

import h5py
from anndata.experimental import write_elem
import numpy as np

f = h5py.File("tmp.h5", "w")
X = np.ones((10_000, 10_000))

%memit write_elem(f, "X", X)
# peak memory: 940.14 MiB, increment: 0.00 MiB

%memit write_elem(f, "X2", f["X"])
# peak memory: 1702.89 MiB, increment: 762.75 MiB

The second write doubles the amount of memory. We can move to a chunked approach to writing pretty easily from the solution suggested here:

dst_ds = f.create_dataset_like('dst', src_ds, dtype=np.int64)

for chunk in src_ds.iter_chunks():
    dst_ds[chunk] = src_ds[chunk]

Versions

-----
IPython             8.26.0
anndata             0.11.0.dev168+g8cc5a18
h5py                3.11.0
numpy               1.26.4
session_info        1.0.0
-----
asciitree           NA
asttokens           NA
bottleneck          1.4.0
cloudpickle         3.0.0
cython_runtime      NA
dask                2024.8.1
dateutil            2.9.0.post0
decorator           5.1.1
executing           2.0.1
importlib_metadata  NA
jedi                0.19.1
jinja2              3.1.4
markupsafe          2.1.5
memory_profiler     0.61.0
msgpack             1.0.8
natsort             8.4.0
numcodecs           0.13.0
numexpr             2.10.1
packaging           24.1
pandas              2.2.1
parso               0.8.4
prompt_toolkit      3.0.47
psutil              5.9.8
pure_eval           0.2.2
pyarrow             15.0.2
pygments            2.18.0
pytz                2024.1
scipy               1.12.0
setuptools          70.3.0
six                 1.16.0
stack_data          0.6.3
tblib               3.0.0
tlz                 0.12.1
toolz               0.12.1
traitlets           5.14.3
typing_extensions   NA
wcwidth             0.2.13
yaml                6.0.1
zarr                2.18.2
zipp                NA
-----
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
Linux-6.8.0-1010-aws-x86_64-with-glibc2.39
-----
Session information updated at 2024-08-28 22:36
ivirshup commented 2 weeks ago

Some complications: