Caching in Scanpy - Githubissues

Marius1311 commented 4 years ago

Hi all! I wanted to make you aware of a caching extension for scanpy and scvelo that @michalk8 and myself have developed called scachepy and to kick off a discussion about caching in scanpy. From my point of view, there are currently two main ways to cache your results in scanpy, please correct me if I'm wrong:

write the AnnData object
manually write the attributes, e.g. adata.X to file, e.g. pickle

The idea of scachepy is to offer the possibility to cache all fields of an AnnData object associated with a certain function call, e.g. sc.pp.pca. It allows you to globally define a caching directory and a backend (default is pickle) that the cached objects will be written to. In the case of PCA, this would amount to calling

import scachepy
c = scachepy.Cache(<directory>) 
c.pp.pca(adata)

where c.pp.pca wraps around sc.pp.pca but takes additional caching arguments like force. So in short, our aim with scachepy is to....

...have a flexible and easy to use way to cache variables associated with scanpy/scvelo function calls.
... speed up individual steps in a scanpy/scvelo analysis by caching them, without having to save the entire AnnData object
... be able to share jupyter notebooks with someone else who can run them on a different machine, possibly on a different OS and yet get the exactly the same results because the critical computations are cached.

@michalk8 is the main developer and will be able to tell you much more about it. I would appreciate any input, and would love to discuss caching in scanpy/scvelo.

LuckyMD commented 4 years ago

I'm not sure i fully understand the point of caching. So you store the exact output of all the computations of a function so that it can be rerun exactly? How big do those objects become?

michalk8 commented 4 years ago

I'm not sure i fully understand the point of caching. So you store the exact output of all the computations of a function so that it can be rerun exactly? How big do those objects become?

We've had problems in the past when running notebooks on different computers (by having different distros or just using the server) or just updating a library produced different results in terms of embedding/clustering...

The other benefit is that if analyzing the data in multiple stages (or multiple times), you'd have to either store the adata object after each stage and then load it for the next one. Or just run it from scratch, which can take some time. Not to mention a forgotten parameter which affects reproducibility. The caching makes this convenient - just run the notebook.

We only store the attributes generated by each function, therefore the size depends on what you cache and the dimensionality of the data. For ~8k cells, PCA takes upto 8MB (if I remember correctly). Currently, there's no compression scheme in place, but I have it on my todo list. The other thing would be to add more control to user during runtime about what needs to be cached.

flying-sheep commented 4 years ago

So basically this would be a transaction layer, right? Like subsequent lines in a Dockerfile:

AnnDatas with certain initial data start with a hash computed from it
Each interaction creates a new state with an associated hash. The difference (and only thing that has to be stored) between two states is all properties that changed.
If you rerun a script with modifications, all steps that didn’t change just forward to the next state, all states after a change are deleted and the steps executed.

Did I get this right?

michalk8 commented 4 years ago

Yes, that's right. However, currently I don't have such thing implemented yet (the compression is done though).

scverse / scanpy

Caching in Scanpy #947