joachim-gassen / sposm

An Open Science Course on Statistical Programming
MIT License
32 stars 43 forks source link

Statistical Programming and Open Science Methods Course

Welcome!

This is the repository of a statistical programming and open science course that we offered 2019/2020 under the research program of the TRR 266 "Accounting for Transparency".

It communicates how to conduct data-based research so that others can contribute and collaborate. This involves making your research data and methods FAIR (findable, accessible, interoperable and reusable) and your results reproducible.

After completing this course, participants will

Course format

The course consists of two block sessions covering two days each and online assignments and group work in between. Students are free to prepare their assignments using a statistical programming language of their choice.

While the course is designed as a blended learning event, it might also be useful for self-study. If you just want to have a quick look, check the slides in slides_pdf. For a deep dive, do the following:

Prerequisites

Intermediate skills in statistics and knowledge of a statistical programming language (e.g., Python, R or Stata) are required. We will mostly work with R during the seminar but students are free to use other languages for their assignments if they prefer. Students that are not familiar with R are strongly encouraged to work through the opening chapters of R for data science by Garrett Grolemund and Hadley Wickham prior to attending the course.

About the repository

This repository follows a "fork and pull request" workflow. Only I can commit to the repository directly. You can and should fork your own versions of this repository, make changes by committing to your repository and then issue a pull request if you think that your changes should be included in this repository.

The directory structure might grow over time. Currently we have ...

... and some directories that will store output from our coding adventures. Do not commit anything to these directories.

Setting up the environment: The local way

NOTE: This step requires you to make substantial changes to your computing environment by installing various additional software. If you do not like that idea, consider using docker instead. This is generally a good idea as we will be using docker anyhow to build portable development environments and replication kits. See below.

You need the following to run the code of the repository (Installation links for Windows in brackets)

After installing these programs, fork this repository on GitHub.

Once you have this up and running start RStudio. Create a new project ("File -> New Project -> Version Control"). Provide the link to your forked directory and choose a local directory that will receive the cloned repository.

After cloning your fork, you will have to install several packages in R that the code relies on. Run the following in the R console (lower left corner).

install.packages(c('tidyverse', 'devtools', 'rmarkdown', 'kableExtra',
'ExPanDaR', 'ggmap', 'tidyr', 'tufte', 'showtext', 'cowplot', 'DiagrammeR',
'leaflet', 'widgetframe', 'zipcode', 'shiny', 'shinyjs', 'grid', 'gridExtra',
'ggwordcloud', 'tm', 'qrcode'), 
repos = c(CRAN = 'https://mran.microsoft.com/snapshot/2019-09-25'))

devtools::install_github('bergant/datamodelr')
devtools::install_github('wmurphyrd/fiftystater')
devtools::install_github('joachim-gassen/rdfanalysis')
webshot::install_phantomjs()

Continue with "Produce all Output" below.

Setting up the environment: The docker way

First you need install docker. When you have new version of MacOS or Windows 10 Professional/Enterprise installed: https://docs.docker.com/get-started/. Read the introductions for your operating system. They are important.

If you happen to have an older/less expensive version of Windows then docker toolbox is your choice: https://docs.docker.com/toolbox/. Read the introductions for your operating system. They are important.

After installing docker, verify that it is running by opening a shell/terminal and issuing the command docker (in the black toolbox window if you run docker toolbox). You should see a help text. If you see the help text, change to the project directory docker and follow the instruction in the Dockerfile.

Once you have logged into the docker instance on the "web page" running RStudio, create a new project within the RStudio instance running in your container ("File -> New Project -> Existing Directory"). Select the sposm directory.

Add your forked repository as the main remote to git.

git remote remove origin
git remote add origin https://github.com/YOURACCOUNT/sposm.git
git remote -v

The last command verifies that the new remote points to your forked repository.

Produce all Output (data and slides)

Now you have your sposm project initialized. Test whether you can run all code to produce the data, the slides and the link list in the data directory. In the Terminal (lower left corner) run:

make all

Setting up git to allow syncing your fork with the main repository

To add the main repository as an additional repository with the name upstream, open the terminal (lower left corner) and add the remote.

git remote add upstream https://github.com/joachim-gassen/sposm.git
git remote -v

To keep your fork in sync with the main repository, you need to follow the strategy explained here: https://help.github.com/en/articles/syncing-a-fork. Additional helpful info can be found here: https://gist.github.com/CristinaSolana/1885435. If this is not sufficient to update your forked repository on Github, have a look here: https://stackoverflow.com/questions/7244321/how-do-i-update-a-github-forked-repository

In a nutshell: Work-flow assuming that you have your forked remote repository as origin and the main repository as upstream and are working in your local repository directory:

# make the changes from the main repository available locally

git fetch upstream

# Switch to your local main branch

git checkout master

# Use only one of the two below. See slide deck 2 for the difference between
# merging and rebasing. Normally rebasing is only needed when you have changes
# in your repo that you want to issue a pull request for after rebasing.

# Alternative A: Merge the changes from upstream

git merge upstream/master

# Alternative B: rebase your local branch on upstream

git rebase upstream/master

# If this has worked out, make sure to push your chnanges to your own remote
# The --force-with-lease is only needed when you are rebasing.

git push --force-with-lease origin master

Tips and Tricks

If knitting fails...

If you are having issues knitting the slides this might be because datamodelr is not yet available on the CRAN. To install datamodelr from GitHub, run:

install.packages("devtools") # NOTE: you may have devtools already installed
devtools::install_github("bergant/datamodelr")

If localhost:8787 fails...

When you use docker toolbox the docker container is not hosted on localhost but instead on a dedicated virtual machine that has a unique IP address. See the Dockerfile for more detail.

Disclaimer

A meme!