Overview of data science pipelines II

Welcome to the second installment of "introduction to data pipelines" at USGS, @jzwart!! :sparkles:

We're assuming you were able to navigate through the intro-to-pipelines course and that you learned a few things about organizing your code for readability, re-use, and collaboration. You were also introduced to two key things through the remake.yml: a way to program connections between functions and files, and the concept of a dependency manager that skips parts of the workflow that don't need to be re-run.

Recap of pipelines I

First, a recap of key concepts that came from intro-to-pipelines :point_down:

Data science work should be organized thoughtfully. As Jenny Bryan notes, "File organization and naming are powerful weapons against chaos".
Capture all of the critical phases of project work with descriptive directories and function names, including how you "got" the data (in practice, we often use fetch for this phase).
Turn your scripts into a collection of functions, and modify your thinking to connect deliberate outputs from these functions ("targets") to generate your final product.
"Skip the work you don't need" by taking advantage of a dependency manager. There were some videos that covered a bit of make and drake, and you were asked to experiment with scipiper.
Investing in efficient reproducibility helps projects scale up with confidence.

This last concept was not addressed directly but we hope that the small exercise of seeing rebuilds in action got you thinking about projects that might have much more lengthly steps (e.g., several downloads or geo-processing tasks that take hours instead of seconds).

What's ahead in pipelines II

In this training, the focus will be on tricks and tips for making better, smarter pipelines. You'll learn new things here that will help you refine your knowledge from the first class and put it into practice. Let's get started!

:keyboard: Activity: Add collaborators and close this issue to get started.

As with pipelines I, please invite a few collaborators to your repository so they can easily comment and review in the future. In the :gear: Settings widget at the top of your repo, select "Manage access" (or use this shortcut link). Go ahead and invite aappling-usgs and jread-usgs. It should look something like this: add some friends

:bulb: Tip: Throughout this course, I, the Learning Lab Bot, will reply and direct you to the next step each time you complete an activity. But sometimes I'm too fast when I :hourglass_flowing_sand: give you a reply, and occasionally you'll need to refresh the current GitHub page to see it. Please be patient, and let my humans know (jread-usgs or aappling-usgs) if I seem to have become completely stuck.

jzwart / ds-pipelines-2

Overview of data science pipelines II #1

Recap of pipelines I

What's ahead in pipelines II

I'll sit patiently until you've closed the issue.

When you are done poking around, check out the next issue.