scverse / mudata

Multimodal Data (.h5mu) implementation for Python
https://mudata.rtfd.io
BSD 3-Clause "New" or "Revised" License
72 stars 17 forks source link

Make update faster in some cases (#16) #17

Closed gtca closed 2 years ago

gtca commented 2 years ago

Addresses #16.

When attr_names (typically var_names) are unique and not intersecting between modalities, this should greatly improve the speed of the mdata.update().

gtca commented 2 years ago

This is how the speed-up looks for the script attached (MuData of size 100 x 600_000):

# master*
> python update-1.py
Creation time: 11.650s
Update time: 15.865s

# update-1*
> python update-1.py
Creation time: 0.886s
Update time: 0.931s
update-1.py ```py # update-1.py import time import numpy as np from anndata import AnnData from mudata import MuData n_mod = 3 mods = dict() times = dict() np.random.seed(100) for i in range(n_mod): i1 = i + 1 m = f"mod{i1}" mods[m] = AnnData(X=np.random.normal(size=10_000_000*i1).reshape(-1, 100_000*i1)) mods[m].obs["mod"] = m mods[m].var["mod"] = m for m, mod in mods.items(): mod.var_names = [f"{m}var_{j}" for j in range(mod.n_vars)] timer = time.process_time() mdata = MuData(mods) times["creation"] = time.process_time() - timer timer = time.process_time() mdata.update() times["update"] = time.process_time() - timer print(f"Creation time: {times['creation']:.3f}s") print(f"Update time: {times['update']:.3f}s") ```
ilia-kats commented 2 years ago

Maybe add a test for partially intersecting obs_names? I believe that we hit a bug in that recently. Otherwise, LGTM.