"Organizing your Data Analysis Using R Projects, Git and R Markdown" by Bruno Grande

sciprog-sfu / sciprog-sfu.github.io

Scientific Programming Study Group at SFU

https://sciprog-sfu.github.io

Other

17 stars 13 forks source link

"Organizing your Data Analysis Using R Projects, Git and R Markdown" by Bruno Grande #108

Closed BrunoGrandePhD closed 7 years ago

BrunoGrandePhD commented 8 years ago

Description

Do you like the idea of reproducible science, but don't know where to start? Have you been told that you should use Git, but never got around to learning it? Join this workshop to learn about various techniques (e.g. R projects, Git and R Markdown) that will help your science be more reproducible. We will start a simple data analysis project from scratch and build it up while following best practices for reproducibility.

Time and Place

Where: Simon Fraser University, Burnaby Campus, Library Research Commons When: ~~Monday, August 15th, 10:30-12:30 am~~ / Delayed until fall

⟶ Note: This is a two-hour workshop.

Registration

REGISTER HERE

Call for Tips/Links

They say the best way to learn something is to teach it. I'll be doing my research, but I still welcome any feedback I can get on the topic. Feel free to comment below with tips or links on best practices for reproducible data analyses. Thanks in advance!

sjackman commented 8 years ago

Record URLs from which data was downloaded in a script
In that same script calculate the SHA256 of the downloaded files right after they're downloaded
Include a script to install all dependencies using brew, apt-get, yum, pip, R et c.
Try to avoid installing dependencies by hand. Use packaged dependencies whenever possible
Right after installing those dependencies use the same tools to report the installed versions, and save that in a file
Commit all of this to version control
Use TSV/CSV to store all types of data whenever possible
Commit reasonably sized data to version control
Commit SHA256 of large data to version control
When the analysis is complete and the project is wrapped up, calculate the SHA256 of all data files, intermediate files, and final products of the analyses and commit that to version control
Use a workflow manager to run the analysis. I like Make. Many alternatives exist

BrunoGrandePhD commented 8 years ago

@sjackman Any thoughts on how to handle large files that are unfit for version control? For instance, a few weeks ago, I had a 120-MB file that I couldn't commit to GitHub (limit is max. 100 MB). Git LFS and git-annex would be logical solutions to this problem, but I've also read (sorry, no links offhand) that they pose some problems. Notably, Git LFS disables forking on a repository, and cloning a repository is now more complex for new users who aren't familiar with Git LFS.

sjackman commented 8 years ago

Git LFS disables forking on a repository

Ugh. That's terrible. I didn't know that about LFS. My only suggestion is to archive the data online somewhere appropriate, and to include the URL and SHA-256 in the build script.

BrunoGrandePhD commented 7 years ago

See http://bgran.de/rrr/