mozillascience / studyGroupLessons

One-hour introductory lessons on ideas and tools in coding and data wrangling for research.
MIT License
114 stars 37 forks source link

Open Science Utility Belt #7

Open bkatiemills opened 9 years ago

bkatiemills commented 9 years ago

Open Science 101

This is a session series introducing practical skills needed to get started in open science. A Mozilla Science Study Group can use this series to introduce open science over an academic semester.

Help Us Develop This Curriculum

We're trying to answer these questions:

Let us know your thoughts in the comments!

Sessions

  1. Introduction: what & why
    • What skills will be part of this series?
      • working openly through the entire process (not just warehousing things on the web afterwards) in order to leverage collaboration
      • emphasizing legibility of research outputs for the sake of reuse & reproducibility
    • Why do these things matter?
      • lit review on citation benefits, efficacy benefits, retraction scandals & efficiency.
    • Sources: Working Open guide, TBD
  2. Open Data I: Standards & Legibility
    • What is an ontology?
    • How to effectively use data standards and make data legible?
    • Sources: TBD
  3. Open Data II: Clean Data
    • What is 'clean' vs 'dirty' data, and why do they matter?
      • how to keep data organized and easy to reuse at a later date (including in-house reuse); consider metadata, storage and formats.
    • Best practices for making a reusable dataset when no standard exists.
    • Sources: TBD
  4. Collaboration I: Version Control
    • Basic git, with an emphasis on getting to GitHub as a platform for sharing & collaboration.
    • Source: TBD
  5. Collaboration II: Roadmapping
    • How to lay out a project for effective collaboration.
    • Source: Working Open guide.
  6. Collaboration III: Code Review
    • How to set expectations for good contributions that lead to easy-to-review code
    • How to make the code review process fast and efficient
    • Source: Working Open guide, Code Review Teaching Kit
  7. Code Wrangling I: Sustainable Coding
    • Effective use of documentation.
    • Producing end-to-end analysis automation scripts (R, Python, Shell, or make); understanding of how a well-made automation script serves as 'living documentation'.
    • Sources: TBD
  8. Code Wrangling II: Testing
    • Writing test suites to ensure code quality & build trust to support reuse.
    • Sources: this lesson in Python, TBD in R.
  9. Code Wrangling III: Code Packaging
    • Making & distributing packages to support reuse & collaboration.
      • discussion of useful formalisms for organizing data & code in packages / repos
    • Sources: this lesson in Python, and this lesson in R
  10. Publishing & Communication I: Citation & Discoverability
    • Software & data citation
      • DOIs
      • comments on how this addresses discoverability of code & data
    • Authoring for the Web
      • markdown / knittr
      • metadata
    • Sources: Working Open guide, TBD
  11. Publishing & Communication II: The Research Cycle
    • Strategies for opening the entire research process:
      • Grant process
      • Online lab notebooks
      • blogging, twitter & social media
      • protocol publishing
      • study pre-registration
  12. Publishing & Communication III: Licensing
    • open access publishing
      • comments on impact on science in the Global South / decoupling access from privilege
    • Why are licenses necessary?
    • What can they do? What can't they do?
    • Which ones are the most important and how do they work?
    • How to choose a license, and the intersection of licensing and copyright
    • The importance of agreeing on a license explicitly and early on a collaboration
    • sources: TBD
  13. Change Making
bkatiemills commented 9 years ago

@Blahah ha, it could fill one - but it would be nice to wrap up this course with something that sets people on a course of action with the new things they learned. Do you think there's a useful way to go about this in only one session?

ivanhanigan commented 9 years ago

For the section on 3. Open Data II: Clean Data

I recommend some 'convention over configuration' advice, and links to evidence based recommended filing systems. My faves are:

a 2008 book recommended folder structure for statistical programmers

\ProjectAcronym
    \- History starting YYYY-MM-DD
    \- Hold then delete 
    \Admin
    \Documentation 
    \Posted
         \Paper 1
             \Correspondence 
             \Text
             \Analysis
    \PrePosted 
    \Resources 
    \Write 
    \Work

Simple R analysis

This concept originally introduced by Josh Reich as the LCFD framework, on the stack overflow website here http://stackoverflow.com/a/1434424, and encoded into the makeProject R package http://cran.r-project.org/web/packages/makeProject/makeProject.pdf.

# choose your project dir
setwd("~/projects")   
library(makeProject)
makeProject("makeProjectDemo")

# gives
/makeProjectDemo/
    /code/*.R
    /data/
    /DESCRIPTION
    /main.R

# in main.R you put
source("code/load.R")
source("code/clean.R")
source("code/func.R")
source("code/do.R")

More complicated R framework for data analysis

/project/
    /cache/
    /config/
    /data/
    /diagnostics/
    /doc/
    /graphs/
    /lib/
        /helpers.R
    /logs/
    /munge/
    /profiling/
        /01_profile.R
    /reports/
    /src/
        /01_EDA.R
        /02_clean.R
        /03_do.R
    /tests/
        /01_tests.R
    /README
    /TODO

For metadata I like EML

bkatiemills commented 9 years ago

Thanks, @ivanhanigan, this is great stuff! We talked about similar things at Study Group Journal Club at UBC the other week - we read this paper and this other paper which touch on related topics - definitely all things to include. Thanks again for the notes!