Open Science Utility Belt #7

bkatiemills opened 9 years ago

bkatiemills commented 9 years ago

Open Science 101

This is a session series introducing practical skills needed to get started in open science. A Mozilla Science Study Group can use this series to introduce open science over an academic semester.

Help Us Develop This Curriculum

We're trying to answer these questions:

Let us know your thoughts in the comments!


  1. Introduction: what & why
    • What skills will be part of this series?
      • working openly through the entire process (not just warehousing things on the web afterwards) in order to leverage collaboration
      • emphasizing legibility of research outputs for the sake of reuse & reproducibility
    • Why do these things matter?
      • lit review on citation benefits, efficacy benefits, retraction scandals & efficiency.
    • Sources: Working Open guide, TBD
  2. Open Data I: Standards & Legibility
    • What is an ontology?
    • How to effectively use data standards and make data legible?
    • Sources: TBD
  3. Open Data II: Clean Data
    • What is 'clean' vs 'dirty' data, and why do they matter?
      • how to keep data organized and easy to reuse at a later date (including in-house reuse); consider metadata, storage and formats.
    • Best practices for making a reusable dataset when no standard exists.
    • Sources: TBD
  4. Collaboration I: Version Control
    • Basic git, with an emphasis on getting to GitHub as a platform for sharing & collaboration.
    • Source: TBD
  5. Collaboration II: Roadmapping
    • How to lay out a project for effective collaboration.
    • Source: Working Open guide.
  6. Collaboration III: Code Review
    • How to set expectations for good contributions that lead to easy-to-review code
    • How to make the code review process fast and efficient
    • Source: Working Open guide, Code Review Teaching Kit
  7. Code Wrangling I: Sustainable Coding
    • Effective use of documentation.
    • Producing end-to-end analysis automation scripts (R, Python, Shell, or make); understanding of how a well-made automation script serves as 'living documentation'.
    • Sources: TBD
  8. Code Wrangling II: Testing
    • Writing test suites to ensure code quality & build trust to support reuse.
    • Sources: this lesson in Python, TBD in R.
  9. Code Wrangling III: Code Packaging
    • Making & distributing packages to support reuse & collaboration.
      • discussion of useful formalisms for organizing data & code in packages / repos
    • Sources: this lesson in Python, and this lesson in R
  10. Publishing & Communication I: Citation & Discoverability
    • Software & data citation
      • DOIs
      • comments on how this addresses discoverability of code & data
    • Authoring for the Web
      • markdown / knittr
      • metadata
    • Sources: Working Open guide, TBD
  11. Publishing & Communication II: The Research Cycle
    • Strategies for opening the entire research process:
      • Grant process
      • Online lab notebooks
      • blogging, twitter & social media
      • protocol publishing
      • study pre-registration
  12. Publishing & Communication III: Licensing
    • open access publishing
      • comments on impact on science in the Global South / decoupling access from privilege
    • Why are licenses necessary?
    • What can they do? What can't they do?
    • Which ones are the most important and how do they work?
    • How to choose a license, and the intersection of licensing and copyright
    • The importance of agreeing on a license explicitly and early on a collaboration
    • sources: TBD
  13. Change Making
kaythaney commented 9 years ago

Thanks for this, Bill! Would be worth adding links to the Working Open guide here, and seeing if you could line up with some of the language and key categories there to strengthen / augment that work. That also may help with some of the verbiage issues (like "Programming" as a header - not sure that's the best term here, crisp up the language and minimize jargon).

Great start, and more comments to come!

bkatiemills commented 9 years ago

yep, these lessons are going to pull heavily from the WOG, once we agree the curriculum. Changed 'programming' -> 'code wrangling'.

abbycabs commented 9 years ago

Suggested language for the beginning:

Open Science 101

This session series introduces practical skills needed to get started in open science. A Mozilla Science Study Group can use this series to introduce open science over an academic semester.

Help Us Develop This Curriculum

We're trying to answer these questions:

Let us know your thoughts in the comments!



Going through actual sessions now :) Really excited for this work!

bkatiemills commented 9 years ago

From the twitterverse:

@ttimbers suggests data storage & archiving - how to find data associated with a study, how to organize your own data for future reuse; also metadata, storage and formats. @minisciencegirl suggests organizing data, with useful naming schemes, structure etc.

taddallas commented 9 years ago

+1 for @minisciencegirl 's suggestion about naming schemes

Is there any reason to have the Open Data sections after Collaboration sections? It might flow better (in my mind) from data wrangling into setting the wrangled data free (i.e. Open Data), then to dive into collaborations/workflows/code review/version control. This change could also make the transition into publishing easier, as collaborations may lead to publications ... :pray:

abbycabs commented 9 years ago

I'm thinking along the same lines as @taddallas re:flow. Teaching packages before git seems off to me.

bkatiemills commented 9 years ago

The ordering of 'Code Wrangling', 'Collaboration', 'Open Data' and 'Publishing & Communication' are actually just in the order I thought of them in :)

So, how about the order:

taddallas commented 9 years ago

Looks good to me!

One tiny thing: I noticed that much of the material uses Python. Perhaps it would be worthwhile to also show some R examples, as some pretty solid tools for Open Science are built around R (e.g. reproducible analyses and manuscript writing with R Markdown, testing with testthat, etc.). This point is null if the course is designed to be Python-specific, or if you think there'd be too much overlap with the R utility belt you already have.

bkatiemills commented 9 years ago

Nope, R-flavoured implementations of these techniques are definitely something we want! Which will get used depends on the audience, but we definitely want both options for the Code Wrangling section. The current examples are Python for no other reason than I speak Python. That said, I think there was a packages in R lesson from UBC recently I can dig up and link here - if you have a good lesson for testing in R, send it on by!

blahah commented 9 years ago

The only thing that is conspicuously missing to my mind is licenses - they are fundamental to open science and are relevant to all the sections above. I would think these are the most important aspects of licenses to cover:

bkatiemills commented 9 years ago

@Blahah - totally agree, added your points to an additional section under 'publishing & communication' - thanks! One thing that would be super helpful in that section, is ideas for hands-on activities, and engaging ways to introduce things like licenses as well as code and data citation; definitely A-list important stuff, but runs the risk of turning into a really dry lecture about DOIs and copyright.

blahah commented 9 years ago

I think a nice way to introduce licenses and citation is by doing a set of small hands-on data mining tasks. Introducing some frustrating scenarios that are solved by proper licensing and good data citation should be memorable. We just need a paper with great data but no license, and a paper that does something good with someone else's data but doesn't cite it properly.

noamross commented 9 years ago

This may be expanding the scope a bit, but some topics that would have been helpful for me early on, before I really did much coding or had a solid project together, would have been:

bkatiemills commented 9 years ago

@noamross could that first point fit with the social media unit?

I'd love to hear your ideas on your second point - to be honest, content aggregation is a pretty weak part of my own game, I've never found a method I really liked.

noamross commented 9 years ago

Yes, lab notebook could go in social media, but there's a fair amount of the topic that isn't explicitly social: metadata/tagging of notes, formats and organization for searching, plain-text for posterity, etc.

On collecting content, I'm similar. I have a semi-working system of Mendeley + a collection of tagged plain-text notes, but I'm not sure how well it works in terms of collaboration. @cboettig and I once wrote a review together where we built an annotated bibliography using markdown + bibtex, but it felt more like a one-use hack than a system. Ideas from others would be welcome.

Daniel-Mietchen commented 9 years ago

Great suggestions so far.

I agree on the "importance of agreeing on a license explicitly and early", and thus think this should come at the beginning of the course and not at the end. As @Blahah mentioned, this should work well after some moments of reuse-rights-related frustration, which unfortunately remain all too easy to create.

One aspect that I am missing is an overview of where things are or are not open along the research cycle - we are making progress with making research outputs more widely available, but the research process is still mostly closed (safe a few open notebooks), and funding is basically a dark corner (very few proposals are open, and basically no funding decisions).

dsalo commented 9 years ago

Working with collaborators who don't necessarily Get It about the whole "open" thing. This is one of the top questions I get whenever I talk open with people.

DOIs, and how they are not magic but are important. Data citation. Data journals and other data-publication venues. Data-use tracking and metrics, and how to use them to make a tenure case or a grant proposal stronger.

Where to get help shoring up your weak spots -- nobody can do everything!

Basic digital hygiene: backups, basic security, basic digital preservation (why "I'll put it on my website!" is a lousy idea long-term).

Navigating openness vs. privacy in human-subjects and other sensitive research.

How to use Excel, if you must, without making everyone else hate you. What to use when Excel stops being useful (stats packages, relational databases).

tgardner4 commented 9 years ago

Would love to see design of experiments, multiple testing corrections, and quality engineering (reducing variability) of experiments in the curriculum. (Happy to contribute on these subjects.)

wolass commented 9 years ago

To publishing: Digital object identifiers - their importance in citing and version control. (NOT RESTRICTED TO CrossRef's DOI)

@noamross I'm using the knitcitations from @cboettig on a daily basis. So if this is the result of your cooperation it certainly wasn't a one-time hack :)

I would underline the importance of learning markdown and using knitr when collaborating on scientific projects. The most important skills for me were:

  1. Statistics (Coursera courses)
  2. R programming
  3. Markdown
  4. Learning the pipeline: Markdown to Word, and PDF using Knitr package in R studio with knitcitations and BibTeX
  5. Using Mendeley as bibliography database with quick search option (deadly useful)
  6. Putting my results on the project page and sharing them with collaborators
  7. LaTeX <- but this is sth extra
Celyagd commented 9 years ago

FlorencePIron commented 9 years ago

I think that it would be a pity to only present the technical dimensions of open science in this course. Why not explain the values and ideals behind open science and even the tensions between the diverse conceptions of open science? The social and epistemological dimensions of open science? Many researchers are working on that too! For instance, the course could explain that a generalised open science will allow students and scientists from the Global South to participate better in the Global North scientific "conversations". Or, conversely, that it would allow researchers from the North to discover the science made in the Global South, therefore enlarging their social, epistemological and cultural horizons. It should also explain that open science could mean opening science to non-scientists (and not only industry), therefore getting science and society closer, making science more relevant to local chellenges. I hpe that you do not intend to create a course which will only present the neo-liberal discourse of innovation typical of the knowledge economy paradigm, but that you will show the subversive strength of open science when it is associated to a clear conscience of the social, economical and politicial issues of our time.

bkatiemills commented 9 years ago

Great stuff, all! Some responses:

@noamross & @Daniel-Mietchen : I've created a new section in the publishing & communication unit meant to focus on opening up the full research life-cycle, to Daniel's point; Noam, I think lab notebooks fit very nicely in there.

@dsalo : I love your idea about getting others on board with open practices; can you expand a bit, or point to some references? Great idea, but tbh I always just kind of did it and hoped to not get fired later (not a real solution :). As for DOIs (+ @wolass ), totally agree; they fit implicitly in code and data citation in my mind, but I've called them out specifically there now.

@tgardner4 : super valuable content, but can you expand a bit on how we can do this in a cross-disciplinary way? Many Study Groups have ecologists and physicists at the same table; experimental design procedures will diverge quite quickly!

@wolass we've done markdown + knittr lessons before, they were really popular! Rather than diving down a specific toolchain (might get too discipline-specific if we do that), what if we think about authoring for the web? The idea being to create content that is not simply on the web, but can be linked to, described by metadata, machine read, and consumed / distributed in 'webby' ways - I think that will touch a lot of what you mentioned, and fits into the unit on discoverability.

@Celyagd this is great stuff! Our plans hit on a lot of the same things, but what I'd be especially interested in is getting a better picture of the activities / projects that seem to be implied in your outline; Study Group is a very hands on kind of thing, so coming up with illustrative projects is really important to this discussion.

@FlorencePIron great points all; we frame this work around directly applicable and practical skills, because that's what puts butts in seats in our experience. However, that framing does not at all preclude having the conversations you want to have; for example, the inability of university libraries in the Global South to subscribe to a full range of for-profit journals, and the abrupt loss of journal access by Greek academics during their recent budget crisis are things I would expect to see comments on in the Open Access Publishing section. Help inserting that broader cultural context as the curriculum comes into focus would be very welcome.

tgardner4 commented 9 years ago

@BillMills Experimental design is suprisingly general when you understand the core principles. A grandfather of the field (Fisher) developed his method in agricultural experiments (hence the term "split-plot" designs for some specific structures). Yet these same principles are routinely applied in engineering, biology and physics. Below is a potential outline for content:

Quality in Experimental Design - Draft Curriculum for Open Tutorial (C) Riffyn 2015


Additional objectives

Content (summary)

PART 1 Why should I care? There’s gold at your feet and you don’t even know it. Instead you’re chasing phantoms.

Assessing Error: process modeling & variance components What are all the potential sources of error? How do they propagate?

Structuring data How do I organize and manipulate my data for analysis?

Statistical foundations What is the error on my measurements?

Testing Are two measurements different?

Multiple testing corrections Which measurements are different from each other, or from baseline?

Regression Which process variables really matter? Which ones don’t?

DoEs (root cause analysis) How can I learn the most with the least effort?

Outliers / Non-additive noise How do I handle the weird stuff?


Process capability, control How well does my process/assay perform? When is it falling apart?

Process modeling / goal setting How to I set the target performance?

Correcting sources of error I know what is the problem, how do I deal with it?

Assay qualification When is my assay “good”? (Putting all of the above together.)


blahah commented 9 years ago

@tgardner4 The above looks like a nice start on experimental design. It would be valuable in a general science curriculum, but is it specific to open science?

thomasmboa commented 9 years ago

I agree with Florence Piron, Open science is not only open access, open data and open source. There is social dimension of open science which brings together society/people with sciences, this dimension also consider local knowledge, and encourage cirizen science, science Shop and commons. So if you can not integrate this dimension in your curriculum, it is better to remove open science in your title

tgardner4 commented 9 years ago

@Blahah As I see it, these topics are absolutely fundamental. If you can't produce a trustworthy data point, you can't share it. If you can't share it, you don't have open science.

Good coding practices are awesome, but if that code is processing rubbish data, it can only generate rubbish results. And sadly, none of the topics I outlined above are taught adequately in a general science curriculum. Just pick a random sampling of scientists and ask them: what is power, what is variance analysis, what is false discovery rate, and when/how should you apply them? Almost no one knows. And that means no one can truly trust each other's results.

Don't hesitate to challenge my views if you disagree - these are born of two decades in the lab. But I'm very interested in alternative views!

noamross commented 9 years ago

@tgardner4 I agree that these topics are fundamental but also think they are somewhat out of scope for a ~12 lesson group-study on open science. There are, however, some important connections between experimental design and open science that could be addressed, such as:

tgardner4 commented 9 years ago

@noamross I agree with your points. What I outlined is a study group unto it's own. I think you propose a nice solution though - an intro to the topic (and perhaps a pointer to a separate study group dedicated to a full treatment). By including it the open science curriculum - even as an intro - you would teach participants that these are core issues that can't be overlooked in proper scientific pursuit.

I would also suggest that the original open science outline described above (the very first post in this thread) is heavily tilted toward a view that open science = coding + publishing. Absent from this curriculum is anything about experimentation or the scientific process. My suggestions are a reaction to this gap. When I hear "science" my mind goes to experimentation. When I hear "open science" I think: "collaboration on the design, execution and sharing of experiments & results." Code and publishing are a necessary, but not sufficient, portion of the the scientific process.

abbycabs commented 9 years ago

Taking comments from @tgardner4 @noamross and more, there might be more clarity if we change the title to:

Open Science & Data: open research practices when working with scientific data

This could be a follow up series after a broader 'Introduction to Open Science'.

blahah commented 9 years ago

@tgardner4 I agree with all your points - these are essential skills and they are not taught sufficiently well in general science courses. I still think that they are not part of open science specifically, and are too many steps removed from the core toolset of open science to feature heavily in the curriculum. Having a short discussion of the relevant aspects where they are directly related (for example in reusing open data), then linking out to a resource which would be developed separately from this curriculum seems like a good way to go.

dsalo commented 9 years ago

@BillMills There's vastly less written about this than I would like. :/ Sometimes introductory project-management techniques are a way in. I have a slidedeck I'd be happy to share with you if you think it would help; otherwise, maybe the way to approach it is an unconference-style discussion, maybe with a plausible case study as an example.

Another way that can work is using a "horror story" as a discussion seed. In this context, I might use a story about a data-ownership snafu (see examples at ) but obviously almost any interpersonal issue specific to open science can be made to work, if there's an available horror story.

blahah commented 9 years ago

@acabunoc that title sounds like it fits the content better

tgardner4 commented 9 years ago

On Aug 7, 2015, at 4:09 PM, Richard Smith-Unna wrote:


On Aug 7, 2015, at 4:09 PM, Richard Smith-Unna wrote:

@acabunoc that title sounds like it fits the content better

— Reply to this email directly or view it on GitHub

ctb commented 9 years ago

A few cents from wandering through this discussion --

"How to set expectations for good contributions that lead to easy-to-review code" - phrasing sets off alarm bells. Also, testing comes 4 sessions later, which is the wrong way 'round - how do you review code that you can't trust? ;). Tests are a prerequisite for code review. Code coverage analysis is missing, also. I would suggest a checklist among other things.

Code packaging? Nix it, IMO. Or move it much later. (Definitely well after testing.) Reasons: it has a lot of sysadminy type stuff that most people won't know or care about.

Soooooooooo you're saying DOIs enhance discoverability of code? Sounds like a theoretical point that doesn't actually work to me. Fine to mention it but DOIs + code are not terribly useful yet.

publishing and communications: copyright vs license should be in there, no?

Automation & scripting is missing from the entire discussion, and yet it's key to sharing any kind of workflow. <=> reproducibility, which is underemphasized.

In my experience, selling scientists on this stuff is 80% of the battle, once they show up. (Technical skills is the next 80% ;). More and stronger motivation. Very few people seem to worry about incorrect results (oddly enough) so efficiency and reputation is a good focus.

Twitter probably belongs in social media, too. Lurking, favoriting, retweeting, subtweeting.

Looks great overall - I think there are probably many ways through all this material, but this is a nice collection of things to consider for any such course!

bkatiemills commented 9 years ago

@tgardner4 yes! Including a pointer to your content and then following it up as its own series of lessons is an A+ solution, 100% on-board with that.

@acabunoc sorry - which title do you want to replace with that, @tgardner4's or mine? Happy to comply either way, let me know.

@dsalo: please link me to your slide-deck! I'm really keen for this content, but I need some help bringing it into focus.


ctb commented 9 years ago

On Sat, Aug 08, 2015 at 02:27:02PM -0700, Bill Mills wrote:

  • would keep packaging if rephrased - want to encourage people to break their work out into small, reusable parts rather than big plates of spaghetti, but open to other ways to address this.

modularity, maybe? I don't know how this works in R but in Python it is super easy to do syntactically, w/module globals lowering the cost (i.e. unlike Java you don't need to make everything a class to have some privacy ;)

  • would appreciate some help rephrasing the importance of contribution guidelines to address those alarm bells; want to touch on lessons learned here and here, but always open to better phrasing.

I was thinking about this after my comment. How about tying it together with a slightly more advanced lesson on git/github/pull requests?

  • Automation: absolutely agree, need to think about where & how to introduce this, but it should definitely go in. Perhaps folding in @sjackman 's make lesson?

+0.5, or start with some shell/R/Python scripts that do soup-to-nuts analysis (load/transform data/make graph/output summary) and then tack on assert statements. can be done inside knitr too, I think?

sjackman commented 9 years ago

An end-to-end R/Python script is more important than shell/make. If you plan to teach shell scripts, I would teach Makefile scripts soon after. I consider Makefile scripts to be structured, self-documenting shell scripts, and better suited to data analysis than pure shell scripts.

ctb commented 9 years ago

On Mon, Aug 10, 2015 at 06:17:34PM -0700, Shaun Jackman wrote:

An end-to-end R/Python script is more important than shell/make


sjackman commented 9 years ago

Here's a small example of using make for a data analysis pipeline that I used for teaching a one hour introduction to make that I created for the Scientific Programming Study Group at SFU:

sjackman commented 9 years ago

I am adamant however that introductory make is more important than advanced sh.

ctb commented 9 years ago

On Thu, Aug 13, 2015 at 08:26:22AM -0700, Shaun Jackman wrote:

I am adamant however that introductory make is more important than advanced sh.

Interesting. The former is the right way to do things for workflows, the latter is important for personal efficiency...

sjackman commented 9 years ago

I tend to record all my analyses big and small in a Makefile script, so workflows and personal efficiency are nearly one and the same for me. Where you draw the line between basic shell and advanced shell is clearly pretty fuzzy though. To clarify, I would teach make before I taught shell features useful for writing large shell scripts, such as shell functions and parsing options.

quantheory commented 9 years ago

Speaking from the perspective of an "early career" type who did a lot of time as a software software engineer post-undergrad, shell scripting should be deliberately minimized. I don't want to knock it too much, since there are lots of neat "tips n' tricks" for Bash, and probably other shells, especially for the command line. If nothing else, I've certainly won some benefits from fancy .bashrc/.bash_profile scripts.

But maintenance and debugging tend to be quite painful for shell scripts. A well-tested script in a web-focused language like Perl is better, but a script in Python/Ruby really wins on syntax and debug-ability. (I don't happen to have any R experience, partly due to the fields I've worked in.)

Build systems are probably required reading at some point. The tricky bit is that build systems are hard to work in most edge cases. Autotools and CMake make some things easier over bare Makefiles, but they tend to fall down for cross-compiling and for HPC, particularly if you have platform-specific optimization flags. (What percentage of scientists think deeply about whether they are "respecting" user-specified CFLAGS in their Makefiles/CMakeLists/whatever?) For a course this short, and for such a broad audience as "students/scientists", it's hard to see anything to recommend, except for some basic knowledge of make.

dsalo commented 9 years ago

@BillMills Here you go, quite short but I hope pithy:

sjackman commented 9 years ago

Nice slides, Dororthea. I do my own academic writing of manuscripts in Markdown stored on GitHub (e.g. UniqTag), and I've faced resistance from senior colleagues who will consider using only Word with track changes and e-mail/Dropbox. I've taken to exporting Markdown to DOCX with Pandoc, soliciting edits with track changes, and then incorporating those changes back into the original Markdown. Anyone else face a similar situation?

cboettig commented 9 years ago

@sjackman Yes, I've often been in a similar situation and I've used that same strategy. (Which is also handy if the journal only accepts Word).

Alternatively, you can also just paste the source (e.g. LaTeX / md / Rmd) into a word document and send that. I've found this avoids some pitfalls of the pandoc conversion to Word (though this is improving), particularly where equations are concerned. This strategy was introduced to me by a senior colleague who has long worked in LaTeX while collaborating successfully with many Word-only folks. We've both found collaborators are perfectly happy to ignore the markup and just read + track-changes the text. Of course Word is a terrible text-editor that may play havoc with some character encodings, so you cannot always copy-paste the changes whole cloth.

Neither approach is ideal of course. Several collaborators always return documents to me marked in pen anyway, so the question of output format becomes irrelevant. Paper is the great interoperable standard. In the end, manually writing in the changes, as required by any of these approaches, doesn't take that much time and does force you to pay close attention.

blahah commented 9 years ago

@dsalo that deck is outstanding! Great stuff. Could you put a license on it? I would only disagree with one point: in my experience graduate students are ideal agents of change in scientific practise.

@sjackman Yeah, this plagues me in almost every project. It usually goes something like: me and some other collaborators work together on a github repo, authorea page, overleaf, or similar. Then at some point a senior collaborator insists we all start using word with track changes, destroys the automated reference management, and will only work by emailing copies back and forth. It's then a huge effort to restore the paper to a nice format at the end. Another kicker is when they insist on manually editing figure images, rather than letting me edit the code and regenerate, because "it's more efficient for them". :rage: The only solution I can think of is to stop working with those people, which is what I'm trying to do.

bkatiemills commented 9 years ago

Great comments again, all! A few scattered responses:

@ctb @sjackman & other build/workflow management enthusiasts: I think you hit on something important with focusing on end-to-end automation of an analysis; superficially this is a convenience strategy, but more deeply this is a communication strategy for helping others have a hope of reproducing an analysis. This definitely belongs in this curriculum (perhaps without diving toooo far down the make rabbithole). Reviewing the curriculum so far, much of the Sustainable Coding section is redundant with the packaging / modularity unit that follows; I'll re-write this unit momentarily to reflect your conversation.

@dsalo What a fantastic slide deck! And yes - sometimes with a new project, people just need to come into work on Monday and find out everything got version controlled over the weekend :) Your comments make me want to add a 'change making' unit as the last session in the course, but I'm struggling to keep things to size - I'm just going to add it on anyway for now, and we'll see how things evolve once the curriculum actually gets written; I suspect some topics will move / transform as that work gets done.

blahah commented 9 years ago

@BillMills I vote that 'change making' should be a separate course