tanaylab / metacells-vignettes

Vignettes for the Python metacells package
https://tanaylab.github.io/metacells-vignettes/
1 stars 1 forks source link

Computing resources #2

Open MatthiasLienhard opened 6 days ago

MatthiasLienhard commented 6 days ago

I am using the vignette to explore the functionality of the metacells package. Running the cleaning step [6] mc.ut.get_o_numpy(full, "x", sum=True) already consumes > 50 GB of RAM and takes > 4 CPU h. Is this expected? I did not get any warnings during installation or import of the package. If so, could you provide a more lightweight example? I feel this is waste of resources. Also, the notebook simply guesses the number of cores and uses them which is an issue on HPC / shared resources - I suggest to put section #### 4.1.4 about parallelization at the top of the run.

orenbenkiki commented 6 days ago

Admittedly metacells can consume a lot of memory and CPU. For larger data sets (millions of cells) we use 100s of GBs... I wouldn't run metacells on a laptop for any realistic data set.

A lot of the memory usage is because of anndata not using memory mapping (because of "reasons"). We are working on a replacement for anndata which does use memory mapping and would greatly reduce memory consumption.

That said, I agree 50GB is "a bit much" for a vignette. Not sure when I'll be able to create a new data set, though...

Reducing the amount of parallelization won't help with the memory consumption of the parts before the actual computation of the metacells. Still, noting this right at the start sounds like a good idea. Thanks!