How to teach students to organize a data-driven project

ramanshah commented 9 years ago

A recurring challenge in my work with students and postdocs is how to teach them to organize their work. This is something I'd enjoy discussing at the meeting.

Data driven research seems to me to require a corner case of conflicting organizational styles.

There is the temporal organization of experiments or attempts at analysis. One keeps a chronological diary and organizes artifacts like computational results or figures by referencing them to the diary in time-stamped subdirectories. My Ph.D. started in experimental physical science, and even as I transitioned to theory and data analysis, the size and complexity of the data sets and codes I faced was low enough that I could keep a paper notebook as the master organization tool, manually version codes, and keep many duplicates of data sets I worked on. This way, everything was clean, simple, and organized. I've found sticking purely to this organizational style is impossible in my current group due to the size of the data sets and the complexity of the software.

Then there is the semantic organization of perfectable products like software. In my short life as a corporate software engineer, it seemed like a version control system like Git is a complete solution to this kind of organizational problem, albeit one that can have a very long learning curve to become truly nimble and productive. I developed and taught an intermediate-level short course on Git at U Chicago, which was targeted at students who have had a quick start (many had attended Software Carpentry) and spent a year or two back in their research, usually gaining limited proficiency with Git and hitting a wall eventually. This course was well received. While the material is quite standard (cf., for instance, the first third of Pro Git by Scott Chacon), I found a lack of open materials for an in-person course on this subject, and I'd be happy to discuss them if there is interest.

https://github.com/ramanshah/intermediate_git

Finally, there is the data-centric organization of big data sets that can't be copied or versioned naively. This to my knowledge has the weakest tooling. Often, students use a communication tool like email or Slack to discuss informally where to find large data sets sitting on a cluster. These data sets usually beget cached results of each step in multi-step preprocessing pipelines, which are organized in ad hoc directory trees and whose locations are communicated in the same informal way.

My experience digging into individual student workflows is that data-driven research has a particularly messy collision of these three organizational styles. Each student handles this differently, and often they handle it poorly enough that it would be a massive forensics project to reproduce their work if they've left and are out of communication.

I think there's a lot of room to improve how we train people to organize data-driven research. I've seen a lot of progress with some of our students just to explain these three ways to organize their work so that they can get a sense of how to be disciplined with any one of them and to be conscious of the complexities that occur when they start to mix and match them. But there's still a long way to go.

ctb commented 9 years ago

Relevant: A Quick Guide to Organizing Computational Biology Projects, by William Noble (2009).

ramanshah commented 9 years ago

Thanks!

tracykteal / moore-ddd-training-club

How to teach students to organize a data-driven project #3