This repository is a template repository for a research project using a Quarto Project and a Reproducible Research Workflow geared for open science. This template serves as a good project base for anyone using R as their primary language for conducting research, but could be retooled for other languages supported by Quarto, like Python and Julia.
To see an example of this template in action, go here.
To start an actual research project with this template:
research-template.Rproj
file in the local repository to the name of your project._quarto.yml
file and change the project title and author information. You can add multiple authors here if you like.utils/check_packages.R
and add or remove any packages based on what your project requires. Source this file in to make sure you have all the dependencies installed._products
directory for the output.data/data_raw
and start coding! You can also learn more about how to use this workflow below.You may also want to update the extensions that are included with the project when you start it. To do this type into the Terminal:
quarto update AaronGullickson/aog-article-quarto
quarto update AaronGullickson/submittable-quarto
This workflow assumes that a research project can generally be divided into three distinct phases. These phases are:
Importantly, this workflow should be iterative as earlier phases are revisited based on extensions or expansions in the project. For example, the researchers may only initially use a few variables and/or observations from the raw data to quickly get a skeleton analysis together with the key variables and models. From this point, they may then go back to the first phase to add in additional variables, observations, deal with missing values, etc. Good programming practices in setting up the structure of the workflow will make this iterative procsess easier later in the project.
In a reproducible research workflow, we use scripting and reproducible reports to reduce the potential for versioning and transcription errors at each phase of the workflow. This workflow is visualized in the flowchart below.
flowchart LR
A((Raw Data)):::real --> B[Data Organization\nScripts]:::real
B([Data Organization\nScripts]):::real --> C((Analytical\nData)):::artifact
C((Analytical\nData)):::artifact --> D([Analysis Scripts]):::real
C((Analytical\nData)):::artifact --> E([Reproducible Reports]):::real
D([Analysis Scripts]):::real --> F[Intermediate Output]:::artifact
E([Reproducible Reports]):::real --> G[Final Products]:::artifact
D([Analysis Scripts]):::real --> E([Reproducible Reports]):::real
classDef real fill:green,color:#fff
classDef artifact fill:yellow,color:#000
Everything shown above in green is the real part of your workflow. This includes the raw data, the scripts, and the reproducible reports. The remaining yellow parts are artifacts and should be reproducible at any point fom the real part of your workflow. You should feel comfortable deleting the yellow parts at any point because they can be reproduced by the green parts. In fact, you should make it a regular practice to delete all the yellow parts and rerun your analysis from the start on a regular basis to ensure artifacts are not affecting your results.
This workflow follows the general principles outlined above. Generally the research process will proceed as follows:
data/data_raw
. When I am using multiple data sources, I often place these in sub-directories. I also usually document each data source in the README within the data_raw
directory. Only raw data should be kept in this directory.analysis/organize_data.qmd
file to construct analytical data from the raw data. I prefer to do this in a quarto document rather than a plain R script because then I get essentially a research log (HTML format) in which I can keep track of all my tests to ensure that my data cleaning worked as intended. Typically, you will save the final analytical data into an RData
file. This file (or files) should always be placed in data/data_constructed
analysis/analysis.qmd
to conduct the analysis. This will produce a lab notebook in HTML format. Sometimes, when it takes a long time to run models or other things, I may save some of the output as RData files and place it in the same data/data_constructed
folder.paper
directory and/or the presentation
directory to create papers and presentations, respectively. I will often pull code chunks from the analysis.qmd
to these files as a starting point for final tables and figures. By default, I use my own custom template to produce manuscript PDF files, but you can switch this to another template of your choice. The bibliography
directory contains a Bibtex file for the project that can be exported from other software or built from within the quarto documents themselves. You can also place your preferred CSL files here.The output of all rendered quarto documents will be placed in the _products
directory. This directory along with data/data_constructed
contain the artifacts from the workflow.
In everyday practice, individual files will be rendered separately, but the entire project can also be rendered. The user should periodically do this because it will automatically delete any prior artifacts and start from scratch, ensuring that all rendered output is up to date with the most current iteration of the scripts. The project can be rendered from the Build
tab in the upper right panel of RStudio, or from the command line of the base project directory:
quarto render
All quarto files will pull author information from the main _quarto.yml
file. The author information there should be changed to reflect the authors of the project. Multiple authors can be listed.
For big projects a single organize_data.qmd
and analysis.qmd
file may not be sufficient. You can split each of those files into multiple files of the same type. If you do, you will need to update the _quarto.yml
file to specify these files and the order in which they should be run in the render
section. For example if you split organize_data.qmd
into organize_data_source1.qmd
and organize_data_source2.qmd
, you would change that section to read:
render:
# if analysis is split into multiple docs, add them here
- analysis/organize_data_source1.qmd
- analysis/organize_data_source2.qmd
- analysis/analysis.qmd
- paper/
- presentation/
- "!bibliography/"
To ensure package dependencies are properly specified and that global functions are loaded, any new R scripts or Quarto document created should always start with:
library(here)
source(here("utils","check_packages.R"))
source(here("utils","functions.R"))
Aside from the here
library, no direct library commands should be written into R scripts or quarto documents. Instead, all of these scripts should source in the utils/check_packages.R
script which will check for package dependencies and install needed packages. Users can add or remove packages from the list specified in that file. Anyone using the project can source this file to both load the dependencies and ensure they are up-to-date.
To create custom functions that will be accessible to all scripts in the project, users should create such functions in the utils/functions.R
script. This script is then sourced into all other scripts.