terraref / computing-pipeline

Pipeline to Extract Plant Phenotypes from Reference Data
BSD 3-Clause "New" or "Revised" License
24 stars 13 forks source link

Protocol for reproducible research and archiving of analyses #344

Open craig-willis opened 7 years ago

craig-willis commented 7 years ago

Question from https://github.com/terraref/computing-pipeline/issues/304:

What would be an appropriate protocol for reproducible research / archiving analyses?

I think it should be sufficient for now if we retain home directories from deleted containers and encourage people to use github repositories for the code that they write. Isn't reproducibility planned as part of a core NDS workbench feature?

See also:

craig-willis commented 7 years ago

I've opened a new issue in Workbench to discuss this. It isn't a current priority, but certainly a good topic.

For now, as with the extractors in TERRA-REF, github is essential. We already retain home directories and should make sure we have backup. I don't think that archiving home directories is the right approach. Taking archiving in it's strictest sense (i.e., library preservation) -- old containers can be stored, but there's currently no strategy that I'm aware of for archiving (for example, what Zenodo does for Github).

We should also consider published versus unpublished artifacts. If I come into Workbench and do some development or analysis, is there a reason to preserve or archive the output of my work if I have no intention of publishing it? If I develop a novel algorithm or analysis that we want to be part of the TERRA project, then there should be a way for me to publish my work or otherwise make it available.

robkooper commented 7 years ago

Another option is to do is just copy the new files from the container. You can use docker diff <container> to get a list of files changed and then use a loop using docker cp -a <container>:<path> outputfolder/<path> and just tar that file up as provenance.

craig-willis commented 7 years ago

Thanks, Rob. Of course, this will need to be implemented as a feature of Workbench since the user doesn't have direct access to Docker on the system. Whether diff/cp or commit/push, we'd need to add this feature.

robkooper commented 7 years ago

One thing though is to make sure the user knows this is going to happen so they don't put any private info (such as private keys/passwords) in the container.