User story: reproducibility mode

grst commented 1 year ago

Let's assume I did some gene identifier remapping to harmonize several datasets (see #1). I want a way to make sure that if I rerun the same analysis in 2 years, I get exactly the same result (the database might have been updated in the meanwhile).

This requires versioning of the database. An alternative solution would be to store a local copy of the database (see #5) that I can archive alongside the project.

slobentanzer commented 1 year ago

This is already a use case in BioCypher. The low tech solution is to build a prior knowledge source specific to a project that can be shared (flat files) and archived alongside the project. These can often be quite small even for genome scale information (dozens of MBs). Identifier mapping flat files can be simply saved as CSVs already and only take up a couple MBs at most. But I guess there are other cases where the prior knowledge is more complex.

There is also versioning functionality built into BioCypher, but that requires maintenance, something which many labs are not willing or able to do. So for a lab-maintained prior knowledge store, sure; for individual projects, I would go with the flat files.

grst commented 1 year ago

I agree that project-specific flatfiles are probably most reliably.

Can you elaborate how I would do that in practice (or point me to a resource)?

As a user, I probably don't want to go through the trouble of building a project-specific knowledge source, but rather have the package create those flatfiles automatically when querying the (online) database.

E.g.

# might even be default
annotation_pgk.enable_reproducibility_mode(cache=".cache")

# First function call uses the online database and creates a backup in `.cache`. 
annoate_gene_symbols(adata, source="ensembl")

# second function call uses the cached version
annoate_gene_symbols(adata, source="ensembl")

slobentanzer commented 1 year ago

This is technically tricky; we have this issue with pypath as well, which does exactly what you describe in your example, but the cache management goes no further than that. So the only way at the moment to do this with the cache module of pypath is to copy that .cache folder, and revert it back to that state once you want to reproduce the analysis. (One could also set the cache path of pypath to the project folder programmatically, but that seems a bit too hacky for being a transparent and easy-to-use function.)

We have been searching for a python cache manager, but most of these cache functions, not disk content. Closest to what we want is @motey's Buffy: https://git.connect.dzd-ev.de/dzdpythonmodules/buffy/-/tree/main/buffy. However, that is in alpha and currently runs with a local REDIS instance only, which is not what I would trust the average analyst with to cache their prior knowledge.

BioCypher can do the same thing, and even write to a project subdirectory without issue, but a cache save/load functionality is not implemented yet. Most useful for the future would probably be that the separate user-friendly module (which has biocypher as a dependency) can save and load a local cache in other formats than the current Neo4j CSVs. Which formats would be best we need to determine, probably also depends on the scope of the project.

The user-friendly module could import both biocypher and the cache manager (possibly buffy) and the use BioCypher to build the prior knowledge, caching it to the project directory using the manager.

@deeenes

motey commented 1 year ago

I just finished my current project and will evaluate with @mpreusse next week, if we can invest some more time in buffy; bring it to a 1.0.0 release and maybe even write a python native state storage solution.

grst commented 1 year ago

You'd expect that caching python functions is kind of a solved problem :thinking: What about https://joblib.readthedocs.io/en/latest/memory.html?

But maybe we also don't want to cache the output of the function, but rather the datasource the function uses as input?

ivirshup commented 1 year ago

we also don't want to cache the output of the function, but rather the datasource the function uses as input?

Kinda my thinking, though I'm not sure cache-ing is the right way to think about reproducibility here.

I'm pretty sure all of the data providers here are going to have versioned releases. I think we should just be able to tie an analysis to a release. It would also be good to store some metadata on annotated objects about which versions of which sources were used (which could allow automatically selecting the right resource downstream).

grst commented 1 year ago

Kinda my thinking, though I'm not sure cache-ing is the right way to think about reproducibility here.

I would still trust a flatfile on my disk/in my git repo more than having to rely on whatever webservice being online in $n$ years, even if it is versioned.

slobentanzer commented 1 year ago

Agree with both; this is what I meant: most Python caching solutions cache functions, to speed up execution. That's not what we need. We want a "download manager" to manage the original input files that we use to build our project database. Optimally with settings such as lifetime, version, etc.

If we continue with Buffy, we could then also use it to manage not only the primary downloads, but also address @grst's original objective of managing the database for each project (could even be integrated with what was proposed in #5 to make data available in an isolated environment). The flat files could live in the repo directly (BioCypher often has small DB files), or in the case of bigger datasets somewhere on Zenodo or similar.

@motey it is great to hear that you consider extending Buffy features. This repo here is meant as an organising repo for a hackathon we are planning end of April (26th to 28th), so we could also mobilise some programming resources to make that happen and integrate with the software packages of our labs and the scVerse. If you have time, it would also be nice to have you at the hackathon (in Heidelberg)!

grst commented 1 year ago

sounds good. Also linking pooch which we considered previously for managing example datasets.

slobentanzer commented 1 year ago

pooch looks good, need to look into it

motey commented 1 year ago

If you have time, it would also be nice to have you at the hackathon (in Heidelberg)!

thanks for the invitation but i'll be probably on vacation around this time.

slobentanzer commented 1 year ago

what better to do on a holiday than to sit in a stuffy room with 30 nerds for 2 days? 🥲

jk, enjoy :)

saezlab / scverse_hackathon

User story: reproducibility mode #6