morinlab / guidelines

A collection of curated guidelines to enable reproducibility in computational biology
0 stars 0 forks source link

Project structure #5

Open BrunoGrandePhD opened 8 years ago

BrunoGrandePhD commented 8 years ago

Notes

Here are some notes from our discussion on project structure. It might make more sense to split this up into separate guidelines.

Raw data versus derived data

As the name implies, raw data hasn't been altered in any way from its original format (usually what comes off an instrument). Examples include FASTQ files from a sequencer or the untidy metadata Excel spreadsheet received from a collaborator. Raw data should be read-only and never deleted/modified.

Derived data would be anything resulting from reformatting the raw data. Examples include BAM files generated from FASTQ files or tidied metadata Excel spreadsheets. These should be completely disposable and can be regenerated from scratch from the raw data.

In practice though, keeping both FASTQ and BAM files for large sequencing projects is cost-prohibitive. A reasonable compromise is to only keep the BAM files, because they represent a superset of the information included in the FASTQ files (read names, pairings, sequences and qualities). In this scenario, the BAM files become the "raw data" for the project and should be made read-only. The commands used to generate them from FASTQ files should still be noted.

Source: http://kbroman.org/steps2rr/pages/organize.html

Storing commands alongside the results they generate or separately

Basically, where do you store the commands that you run to generate your results: alongside the results they generate or separately in a scripts directory? See #18 for why you should store your commands in the first place.

Storing commands alongside your results makes it easy for newcomers to figure out how the results were generated. However, if you use the same command to generate different results, then you might have to duplicate that command, which violates the DRY principle. A workaround is to store commands used more than once in scripts separately from your results and have scripts alongside your results call these common scripts.

One argument against storing commands alongside your results is that your results directory should ideally be completely disposable and you should be able to regenerate it from scratch. For this to be possible, your scripts should reside outside of your results directory.

Another benefit from consolidating your scripts in a separate directory is that it makes it easier to put them under version control. See #16 for why you should use version control.

Source: http://www.jonzelner.net/statistics/make/reproducibility/2016/06/01/makefiles/

Example project structure

data/
    exome_bams_grch38/
    rnaseq_bams_grch38/
figures
    create_figures.R
    images/
        figure1.png
        figure2.png
reference/
    genome.fa
    genome.fa.fai
results/
    strelka_analysis/
        1-strelka/
        2-tabulate/
        3-augment_maf/
software/
    os1/
    os2/
    scripts/