Open Marius1311 opened 4 years ago
I'm not sure i fully understand the point of caching. So you store the exact output of all the computations of a function so that it can be rerun exactly? How big do those objects become?
I'm not sure i fully understand the point of caching. So you store the exact output of all the computations of a function so that it can be rerun exactly? How big do those objects become?
We've had problems in the past when running notebooks on different computers (by having different distros or just using the server) or just updating a library produced different results in terms of embedding/clustering...
The other benefit is that if analyzing the data in multiple stages (or multiple times), you'd have to either store the adata object after each stage and then load it for the next one. Or just run it from scratch, which can take some time. Not to mention a forgotten parameter which affects reproducibility. The caching makes this convenient - just run the notebook.
We only store the attributes generated by each function, therefore the size depends on what you cache and the dimensionality of the data. For ~8k cells, PCA takes upto 8MB (if I remember correctly). Currently, there's no compression scheme in place, but I have it on my todo list. The other thing would be to add more control to user during runtime about what needs to be cached.
So basically this would be a transaction layer, right? Like subsequent lines in a Dockerfile:
Did I get this right?
Yes, that's right. However, currently I don't have such thing implemented yet (the compression is done though).
Hi all! I wanted to make you aware of a caching extension for scanpy and scvelo that @michalk8 and myself have developed called scachepy and to kick off a discussion about caching in scanpy. From my point of view, there are currently two main ways to cache your results in scanpy, please correct me if I'm wrong:
The idea of scachepy is to offer the possibility to cache all fields of an AnnData object associated with a certain function call, e.g.
sc.pp.pca
. It allows you to globally define a caching directory and a backend (default is pickle) that the cached objects will be written to. In the case of PCA, this would amount to callingwhere
c.pp.pca
wraps aroundsc.pp.pca
but takes additional caching arguments likeforce
. So in short, our aim with scachepy is to....@michalk8 is the main developer and will be able to tell you much more about it. I would appreciate any input, and would love to discuss caching in scanpy/scvelo.