Open Science Utility Belt

bkatiemills commented 9 years ago

Open Science 101

This is a session series introducing practical skills needed to get started in open science. A Mozilla Science Study Group can use this series to introduce open science over an academic semester.

Help Us Develop This Curriculum

We're trying to answer these questions:

What skills are needed to practice open science?
What's missing / What's unnecessary? (aiming for < 12 sessions over a semester)
What's out there now? References! If you've seen related material, send it over.

Let us know your thoughts in the comments!

Sessions

Introduction: what & why
- What skills will be part of this series?
  - working openly through the entire process (not just warehousing things on the web afterwards) in order to leverage collaboration
  - emphasizing legibility of research outputs for the sake of reuse & reproducibility
- Why do these things matter?
  - lit review on citation benefits, efficacy benefits, retraction scandals & efficiency.
- Sources: Working Open guide, TBD
Open Data I: Standards & Legibility
- What is an ontology?
- How to effectively use data standards and make data legible?
- Sources: TBD
Open Data II: Clean Data
- What is 'clean' vs 'dirty' data, and why do they matter?
  - how to keep data organized and easy to reuse at a later date (including in-house reuse); consider metadata, storage and formats.
- Best practices for making a reusable dataset when no standard exists.
- Sources: TBD
Collaboration I: Version Control
- Basic git, with an emphasis on getting to GitHub as a platform for sharing & collaboration.
- Source: TBD
Collaboration II: Roadmapping
- How to lay out a project for effective collaboration.
- Source: Working Open guide.
Collaboration III: Code Review
- How to set expectations for good contributions that lead to easy-to-review code
- How to make the code review process fast and efficient
- Source: Working Open guide, Code Review Teaching Kit
Code Wrangling I: Sustainable Coding
- Effective use of documentation.
- Producing end-to-end analysis automation scripts (R, Python, Shell, or make); understanding of how a well-made automation script serves as 'living documentation'.
- Sources: TBD
Code Wrangling II: Testing
- Writing test suites to ensure code quality & build trust to support reuse.
- Sources: this lesson in Python, TBD in R.
Code Wrangling III: Code Packaging
- Making & distributing packages to support reuse & collaboration.
  - discussion of useful formalisms for organizing data & code in packages / repos
- Sources: this lesson in Python, and this lesson in R
Publishing & Communication I: Citation & Discoverability
- Software & data citation
  - DOIs
  - comments on how this addresses discoverability of code & data
- Authoring for the Web
  - markdown / knittr
  - metadata
- Sources: Working Open guide, TBD
Publishing & Communication II: The Research Cycle
- Strategies for opening the entire research process:
  - Grant process
  - Online lab notebooks
  - blogging, twitter & social media
  - protocol publishing
  - study pre-registration
Publishing & Communication III: Licensing
- open access publishing
  - comments on impact on science in the Global South / decoupling access from privilege
- Why are licenses necessary?
- What can they do? What can't they do?
- Which ones are the most important and how do they work?
- How to choose a license, and the intersection of licensing and copyright
- The importance of agreeing on a license explicitly and early on a collaboration
- sources: TBD
Change Making
- how to champion change in real life?
- what barriers are commonly encountered, and how to avoid them?
- sources: https://speakerdeck.com/dsalo/changing-workflows

bkatiemills commented 9 years ago

@Blahah ha, it could fill one - but it would be nice to wrap up this course with something that sets people on a course of action with the new things they learned. Do you think there's a useful way to go about this in only one session?

ivanhanigan commented 9 years ago

For the section on 3. Open Data II: Clean Data

how to keep data organized and easy to reuse at a later date (including in-house reuse)

I recommend some 'convention over configuration' advice, and links to evidence based recommended filing systems. My faves are:

a 2008 book recommended folder structure for statistical programmers

http://www.indiana.edu/~jslsoc/web_workflow/wf_home.html
Recently updated with Long, S. (2015). Workflow for Reproducible Results. IV : Managing digital assets Workflow for Tools for your WF. http://txrdc.tamu.edu/documents/WFtxcrdc2014_4-digital.pdf

\ProjectAcronym
    \- History starting YYYY-MM-DD
    \- Hold then delete 
    \Admin
    \Documentation 
    \Posted
         \Paper 1
             \Correspondence 
             \Text
             \Analysis
    \PrePosted 
    \Resources 
    \Write 
    \Work

Simple R analysis

This concept originally introduced by Josh Reich as the LCFD framework, on the stack overflow website here http://stackoverflow.com/a/1434424, and encoded into the makeProject R package http://cran.r-project.org/web/packages/makeProject/makeProject.pdf.

# choose your project dir
setwd("~/projects")   
library(makeProject)
makeProject("makeProjectDemo")

# gives
/makeProjectDemo/
    /code/*.R
    /data/
    /DESCRIPTION
    /main.R

# in main.R you put
source("code/load.R")
source("code/clean.R")
source("code/func.R")
source("code/do.R")

More complicated R framework for data analysis

http://projecttemplate.net

/project/
    /cache/
    /config/
    /data/
    /diagnostics/
    /doc/
    /graphs/
    /lib/
        /helpers.R
    /logs/
    /munge/
    /profiling/
        /01_profile.R
    /reports/
    /src/
        /01_EDA.R
        /02_clean.R
        /03_do.R
    /tests/
        /01_tests.R
    /README
    /TODO

For metadata I like EML

https://github.com/ropensci/EML
Ecological Metadata Language interface for R: synthesis and integration of heterogenous data

bkatiemills commented 9 years ago

Thanks, @ivanhanigan, this is great stuff! We talked about similar things at Study Group Journal Club at UBC the other week - we read this paper and this other paper which touch on related topics - definitely all things to include. Thanks again for the notes!

mozillascience / studyGroupLessons