scverse / mudata

Multimodal Data (.h5mu) implementation for Python
https://mudata.rtfd.io
BSD 3-Clause "New" or "Revised" License
72 stars 17 forks source link

Loading MuData objects from disk is slow, but it's not the IO #4

Closed ivirshup closed 1 year ago

ivirshup commented 2 years ago

Describe the bug

Loading MuData objects from disk is slow, but it's not the IO

To Reproduce

Using mudatasets and snakeviz

import mudatasets
import mudata

brain3k = mudatasets.load("brain3k_multiome")
brain3k.write_h5mu("brain3k.h5mu")
%%snakeviz
mudata.read_h5mu("brain3k.h5mu")
image

75% of the time is spent after the data is in memory.

Expected behaviour

IO operations should be disk bound

Version info ``` ----- mudata 0.1.0 mudatasets 0.0.1 session_info 1.0.0 ----- anndata 0.7.9.dev5+g62089e2 anyio NA appnope 0.1.2 asciitree NA attr 21.2.0 babel 2.9.1 backcall 0.2.0 certifi 2021.10.08 charset_normalizer 2.0.8 cloudpickle 2.0.0 cython_runtime NA dask 2021.11.2 dateutil 2.8.2 debugpy 1.5.1 decorator 5.1.0 entrypoints 0.3 fasteners NA fsspec 2021.11.1 google NA h5py 3.6.0 idna 3.3 ipykernel 6.5.1 ipython_genutils 0.2.0 jedi 0.18.1 jinja2 3.0.3 json5 NA jsonschema 4.2.1 jupyter_server 1.12.0 jupyterlab_server 2.8.2 llvmlite 0.37.0 markupsafe 2.0.1 mpl_toolkits NA msgpack 1.0.3 natsort 8.0.0 nbclassic NA nbformat 5.1.3 numba 0.54.1 numcodecs 0.9.1 numexpr 2.7.3 numpy 1.20.3 packaging 21.3 pandas 1.3.4 parso 0.8.2 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA prometheus_client NA prompt_toolkit 3.0.22 psutil 5.8.0 ptyprocess 0.7.0 pvectorc NA pydev_ipython NA pydevconsole NA pydevd 2.6.0 pydevd_concurrency_analyser NA pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pyexpat NA pygments 2.10.0 pyrsistent NA pytz 2021.3 requests 2.26.0 scipy 1.7.3 send2trash NA setuptools_scm NA sitecustomize NA six 1.16.0 sniffio 1.2.0 sparse 0.13.0 sphinxcontrib NA storemagic NA tblib 1.7.0 terminado 0.12.1 tlz 0.11.2 toolz 0.11.2 tornado 6.1 tqdm 4.62.3 traitlets 5.1.1 typing_extensions NA urllib3 1.26.7 wcwidth 0.2.5 websocket 1.2.1 yaml 6.0 zarr 2.10.3 zmq 22.3.0 ----- IPython 7.29.0 jupyter_client 7.1.0 jupyter_core 4.9.1 jupyterlab 3.2.4 notebook 6.4.6 ----- Python 3.9.9 (main, Nov 21 2021, 03:23:42) [Clang 13.0.0 (clang-1300.0.29.3)] macOS-11.6.1-x86_64-i386-64bit ----- Session information updated at 2021-12-08 22:09 ```
gtca commented 2 years ago

Thanks, it seems like it's due to the current.update() logic that we plan to... well, update, to speed it up when obs/var names are unique as they should be. We also recreate a few things upon load that we don't store in the file including metadata columns from individual modalities. So it is expected to take a bit more time than loading two modalities individually.

ivirshup commented 2 years ago

Whats the plan with update? Is there an issue here I could look at?

I must admit, I also don't totally remember what it's meant to do. Apart from recalculating sizes, what does it do?

gtca commented 2 years ago

To get back to this after some discussions, I have opened #16 and a draft implementation of its solution in #17.

ivirshup commented 2 years ago

As an update here, reading the unfiltered data in the pbmc protein example is extremely slow due to mudata creation. It seems to be taking between 10x and 15x the IO time, but obs_names should be exactly the same in this case.

image
gtca commented 2 years ago

@ivirshup, I see an issue but have you tried scanpy?

# read_10x_scanpy.py
import scanpy as sc
adata = sc.read_10x_mtx("muon-tutorials/data/pbmc5k_protein/filtered_feature_bc_matrix/")
time python read_10x_scanpy.py
# =>  33.63s user 2.02s system 93% cpu 38.188 total
# read_10x_muon.py
import muon as mu
mdata = mu.read_10x_mtx("muon-tutorials/data/pbmc5k_protein/filtered_feature_bc_matrix/")
time python read_10x_muon.py
# =>  34.56s user 2.52s system 89% cpu 41.559 total

Another way to show what I mean:

both = sc.read_10x_mtx("muon-tutorials/data/pbmc5k_protein/filtered_feature_bc_matrix/", gex_only=False)
prot = both[:,both.var.feature_types == "Antibody Capture"].copy()
rna = both[:,both.var.feature_types == "Gene Expression"].copy()

%%timeit
mdata = MuData({"rna": rna, "prot": prot})
# => 91.1 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
ivirshup commented 2 years ago

Yes. It didn't look like there was much overhead associated with the filtered dataset, it's the "raw" one that took a really long time.

%%time
adata = sc.read_10x_h5("/Users/isaac/data/5k_pbmc_protein_v3/raw_feature_bc_matrix.h5", gex_only=False)
CPU times: user 6.85 s, sys: 549 ms, total: 7.39 s
Wall time: 7.41 s
%time mdata = muon.read_10x_h5("/Users/isaac/data/5k_pbmc_protein_v3/raw_feature_bc_matrix.h5")
CPU times: user 3min 25s, sys: 22.9 s, total: 3min 48s
Wall time: 3min 57s

Also, I would recommend generally pointing people at reading 10x's h5 files since reading those is going to be much faster than reading the text file.

ivirshup commented 2 years ago

I think the issue is overhead of MuData construction when there are 6,794,880 barcodes.

gtca commented 2 years ago

@ivirshup Can you try #24?

%%time
adata = sc.read_10x_h5("5k_pbmc_protein_v3_raw_feature_bc_matrix.h5", gex_only=False)
# => CPU times: user 9.43 s, sys: 1.15 s, total: 10.6 s
# => Wall time: 10.8 s

%%time
mdata = mu.read_10x_h5("5k_pbmc_protein_v3_raw_feature_bc_matrix.h5")
# => CPU times: user 19.6 s, sys: 2.26 s, total: 21.8 s
# => Wall time: 22.6 s

mdata.shape
# => (6794880, 33570)
ivirshup commented 2 years ago

Seems better, though it still seems slower than I assume it would be. To my mind it shouldn't be much slower than reading, followed by prot.obs_names.equals(rna.obs_names).

Any plans on simplifying the _update_attr function?

gtca commented 2 years ago

@ivirshup I would agree but we do use data frame joins to collate data and to get global obs_names/var_names. We can potentially take care of those steps separately ourselves but that would increase the amount of code in _update_attr rather than reduce.

A simple step would be to concatenate rather than join if we're sure the index is the same, that would address this use case I believe — and also most practical use cases in the current multimodal workflows when importing data. We would still need to run joins (or an equivalent operation) when indices are not the same (e.g. one of the modalities has been filtered).

ivirshup commented 2 years ago

I think it could be reasonable to do joining, but I don't think the function is just doing a join. From the profiling most of the excess time is coming from various isin calls.

image

(the pink box is the concat)

As an aside: I think I would also like the option to not join the columns. E.g. maybe I want to create a bundle of objects in a MuData, then normalize annotations, and then create the joint tables.

gtca commented 2 years ago

Actually, the join on millions of raw barcodes was not an issue in this case.

I've profiled it more and I've got rid of redundant index comparisons when they haven't changed. This is in #24.

%%time
adata = sc.read_10x_h5("5k_pbmc_protein_v3_raw_feature_bc_matrix.h5", gex_only=False)
# => CPU times: user 9.58 s, sys: 1.18 s, total: 10.8 s
# => Wall time: 11.3 s

%%time
mdata = mu.read_10x_h5("5k_pbmc_protein_v3_raw_feature_bc_matrix.h5")
# => CPU times: user 11.7 s, sys: 1.49 s, total: 13.2 s
# => Wall time: 13.7 s

Speaking of

Also, I would recommend generally pointing people at reading 10x's h5 files since reading those is going to be much faster than reading the text file.

Is there a reason why read_10x_h5 is much less feature-complete than read_10x_mtx then? I'd expect to at least be able to read gene_ids as they are available in the .h5 file.

ivirshup commented 2 years ago

I'd expect to at least be able to read gene_ids as they are available in the .h5 file.

Are there times when it doesn't do that?

image
gtca commented 2 years ago

@ivirshup, sorry that wasn't the best phrasing in my message... I did mean reading them as var_names (var_names="gene_ids").