hcorson-dosch-usgs / ds-pipelines-targets-1

https://lab.github.com/USGS-R/intro-to-targets-pipelines
0 stars 0 forks source link

Why use a dependency manager? #5

Closed github-learning-lab[bot] closed 3 years ago

github-learning-lab[bot] commented 3 years ago

We're asking everyone to invest in the concepts of reproducibility and efficiency of reproducibility, both of which are enabled via dependency management systems such as remake, scipiper, drake, and targets.

Background

We hope that the case for reproducibility is clear - we work for a science agency, and science that can't be reproduced does little to advance knowledge or trust.

But, the investment in efficiency of reproducibility is harder to boil down into a zingy one-liner. Many of us have embraced this need because we have been bitten by issues in our real-world collaborations, and found that data science practices and a reproducibility culture offer great solutions. Karl Broman is an advocate for reproducibility in science and is faculty at UW Madison. He has given many talks on the subject and we're going to ask you to watch part of one of them so you can be exposed to some of Karl's science challenges and solutions. Karl will be talking about GNU make, which is the inspiration for almost every modern dependency tool that we can think of. Click on the image to kick off the video.

reproducible workflows with make

:computer: Activity: Watch the above video on make and reproducible workflows up until the 11 minute mark (you are welcome to watch more)

Use a GitHub comment on this issue to let us know what you thought was interesting about these pipeline concepts using no more than 300 words.


I'll respond once I spot your comment (refresh if you don't hear from me right away).

hcorson-dosch-usgs commented 3 years ago

So last time I watched this, this was my response:

I was familiar with and had worked to achieve the 1st step (Organize Your Data and Code). The 2nd step (Everything with a Script) is something I think I've made a lot of progress on in the last year of learning better coding practices, but I think his point about coding even the smallest actions -- that could be done manually -- is a really good one, and I'd like to implement that more. I've definitely not done the 3rd step (Automate the Process) except for in Arc, where you can chain analyses using model builder (though that's not quite the same as automating scripts).

I think my additional takeaway this time is that both scipiper and targets seem like a much better system than that he presents. I'm excited to learn more targets. From my initial exposure for reservoir modeling I've found it more intuitive.

Also gets me thinking about how we can bring more reproducibility to our dataviz work, but that's a convo for another time...

github-learning-lab[bot] commented 3 years ago

Great comments @hcorson-dosch! :sparkles:

You could consider GNU make to be a great grandparent of the packages we referred to early in this lesson (remake, scipiper, drake, and targets). Will Landau, the lead developer of targets, has added a lot of useful features to dependency management systems in R, and has a great way of summarizing why we put energy into using these tools: "Skip the work you don't need"

We'd like you to next check out a short part of Will's video on targets

reproducible workflows with R targets

:tv: Activity: watch video on targets from at least 7:20 to 11:05 (you are welcome to watch the full main talk from 7:20 to 7:40)

Use a github comment on this issue to let us know what contrasts you identified between solutions in make and what is offered in R-specific tools, like targets. Please use less than 300 words. Then assign your onboarding cohort team member this issue to read what you wrote and respond with any questions or comments.


When you are satisfied with the discussion, you can close this issue and I'll direct you to the next one.

hcorson-dosch-usgs commented 3 years ago

So last time I said:

The make approach (which, based on the first video, seems to be rooted in command line) doesn't seem ideal for a project where the entirety (or majority) of the code base is in R. It is less than ideal to have to leave R to automate your workflow, and I believe that it is harder to document actions taken in command line. Also - the r-specific pipeline tools allow for output from scripts to be stored and used as R objects, which would facilitate coding in R and increase the efficiency of the pipeline (I think) by eliminating the need to write all output to files and read it back in. From the talk, it sounds like the language-agnostic tools also maybe don't support functions? Perhaps because not all languages do? R-based pipeline tools do support functions, which means you can break the necessary tasks into clean units that are easily reused in different places within a project, or by different projects. Finally, it sounds like the R-specific pipeline tools, unlike Make, store datasets as dataframes that can be easily manipulated using R packages.

Jordan clarified in his comments that make does track functions, but only in that it tracks the file in which those functions are defined. It doesn't go to the level that drake or scipiper or targets do, of actually assessing whether the operations of a function have changed.

I think an added nuance that targets brings to the table is that it abstracts files as object targets. It also offers many different ways to map targets, and accomplish repetitive tasks like those we'd use task tables for in scipiper. I got some exposure to this in the reservoir pipeline but don't have a full understanding of all the ways to pipeline with targets, so I'm excited to learn more in these courses!

lindsayplatt commented 3 years ago

Right! targets is able to go beyond seeing file changes and actually understand the type of change (for R files). Adding a # comment like this won't actually trigger a rebuild 🎊

github-learning-lab[bot] commented 3 years ago


When you are done poking around, check out the next issue.