jacobtnyoung / reproducible-research

0 stars 0 forks source link

Email from Jesse with details #2

Open jacobtnyoung opened 12 months ago

jacobtnyoung commented 12 months ago

It looks great!

Definitely not necessary for tomorrow, but the two missing pieces to your reproducible workflow are (1) data versioning or data "provenance" and (2) software versioning.

GitHub allows you to version your code. That means you have a history of changes. You can avoid the problem encountered by Shapiro, which was multiple people making changes to code in the project folder but there was no history of changes so they could not reproduce their own results from the paper version they sent for review.

https://web.stanford.edu/~gentzkow/research/CodeAndData.pdf

The one requirement to ensure that versioning actually works is you need to have a time stamp on the results and you need to be able to roll back data and code to that exact point in time. You may need a rule about making all datastep commits before starting any analysis. And you need to know how to roll back a repo to a specific point in time. But in theory it's all possible with GitHub.

I would not burden your audience with all of the nuance, but it is important to note that merely using GitHub does not ensure that the workflow is reproducible. It actually has to be used correctly!


The software versioning issue is less about reproducibility and more about maintainability. If you try to run the code 5 years from now and it requires several R packages to run, there is a HIGH likelihood that one of the packages has changed - they renamed a function or changed how an argument is used. As a result your code breaks and you have to spend hours to debug so that it is functional again. This also makes open science repositories less useful when you can't run code easily. 

The renv package allows you to create a package snapshot so that whoever is running your code can use the same versions of the software, and thus the versions that correspond to your same function names and arguments. Adding renv to the workflow ensures code remains reproducible over time. 

And data provenance is more about creating a process where the data acquisition process is fully replicable. This is less important for social sciences - we typically have the raw version of the data on our laptop and it does not change. But once data is too large to store on a single machine or is produced by algorithms that generate aggregated data tables to protect privacy, the process of retrieving the same data becomes important. The FAIR data standards address those issues: https://royalsocietypublishing.org/doi/10.1098/rsta.2021.0300

But also, way beyond the scope of your workshop tomorrow.