Why use a dependency manager?

github-learning-lab[bot] commented 2 years ago

We're asking everyone to invest in the concepts of reproducibility and efficiency of reproducibility, both of which are enabled via dependency management systems such as remake, scipiper, drake, and targets.

Background

We hope that the case for reproducibility is clear - we work for a science agency, and science that can't be reproduced does little to advance knowledge or trust.

But, the investment in efficiency of reproducibility is harder to boil down into a zingy one-liner. Many of us have embraced this need because we have been bitten by issues in our real-world collaborations, and found that data science practices and a reproducibility culture offer great solutions. Karl Broman is an advocate for reproducibility in science and is faculty at UW Madison. He has given many talks on the subject and we're going to ask you to watch part of one of them so you can be exposed to some of Karl's science challenges and solutions. Karl will be talking about GNU make, which is the inspiration for almost every modern dependency tool that we can think of. Click on the image to kick off the video.

:computer: Activity: Watch the above video on make and reproducible workflows up until the 11 minute mark (you are welcome to watch more)

Use a GitHub comment on this issue to let us know what you thought was interesting about these pipeline concepts using no more than 300 words.

I'll respond once I spot your comment (refresh if you don't hear from me right away).

sleung-usgs commented 2 years ago

I agree with all of these steps, but I do wonder how exploration/initial quick analyses before you really know what you are doing/trying to do, fits into this. You don't want to immediately automate or turn everything into functions when you are just starting out, so I think there's also a balance.

github-learning-lab[bot] commented 2 years ago

Great comments @sleung-usgs! :sparkles:

You could consider GNU make to be a great grandparent of the packages we referred to early in this lesson (remake, scipiper, drake, and targets). Will Landau, the lead developer of targets, has added a lot of useful features to dependency management systems in R, and has a great way of summarizing why we put energy into using these tools: "Skip the work you don't need"

We'd like you to next check out a short part of Will's video on targets

:tv: Activity: watch video on targets from at least 7:20 to 11:05 (you are welcome to watch the full talk if you'd like)

Use a github comment on this issue to let us know what contrasts you identified between solutions in make and what is offered in R-specific tools, like targets. Please use less than 300 words. Then assign your onboarding cohort team member this issue to read what you wrote and respond with any questions or comments.

When you are satisfied with the discussion, you can close this issue and I'll direct you to the next one.

jesse-ross commented 2 years ago

I agree with all of these steps, but I do wonder how exploration/initial quick analyses before you really know what you are doing/trying to do, fits into this. You don't want to immediately automate or turn everything into functions when you are just starting out, so I think there's also a balance.

I couldn't agree more. I think that the reason we new hires have these trainings is not to make everybody start making everything reproducible all of the time - that would be an unnecessary drag! But by going through this, we develop some shared instincts for what kinds of things are easy to make reproducible, as well as a few details of tool kit. That underlying knowledge can guide our casual explorations and make it easier to build in more solidity if/when those explorations are starting to prove fruitful.

sleung-usgs commented 2 years ago

In R-specific tools like targets, files are abstracted as R objects and data is automatically managed. targets also supports a modular and function-oriented programming style. The framework and philosophy are the same as GNU make, however. (I honestly don't know what any of that means yet, but I'm sure I'll learn in a moment.)

jesse-ross commented 2 years ago

😁 Yes. If you haven't used make a lot, then you're perfectly normal. I'm in the same boat. make is really a tool for C programmers. It is meant to keep track of dependencies between different phases of complicated software compilations, and some primeval data scientists hacked it into a tool for managing data workflows. The core problem, of certain parts of work depending on prior parts having been completed, is common between compiling software projects and doing complex data pipelines.

The beautiful thing about make for data scientists is that it could track which parts of a pipeline depended on which others, and only rebuild the necessary parts. It's a time saver. This is especially true with data work, where the processing can take a long time and once it's done you'd really like it to stay done.

However, the beauty of targets is that it's integrated very tightly with R. Whereas make can only tell that something has changed by whether a file has changed at all, targets is able to read into your R code and ignore differences which don't change the code (e.g. comments, whitespace). Additionally it readily can store intermediate data products in R-native formats, without having to write a bunch of connector code to read and write things.

sleung-usgs commented 2 years ago

Thank you, Jesse!! That honestly makes things a lot clearer. The point about how both complex software projects and data science pipelines run into similar dependency problems helped me better understand the connection between GNU make and targets.

github-learning-lab[bot] commented 2 years ago

sleung-usgs / ds-pipelines-targets-1