Closed github-learning-lab[bot] closed 3 years ago
In general, if building part of a pipeline is "expensive" (i.e., takes more than a trivial amount of time for a computer to execute), it should be a separate target. In the example above :point_up:, expensive sections included fetching data and plotting.
Additional reasons to create a target include:
tar_load(my_target)
, or reading in a file that was created). If there is an intermediate step in a workflow that will likely need to be examined, it may deserve a target.wqp.config
target is an example of a shared configuration. Within that target, there is (among other things) a string that specifies how lake sites are queried in the web service. If that query parameter changes in the future, making the change to the file behind the wqp.config
target would propagate into the necessary updates to the data pulls run with get_wqp_data
.But of course there is a cost to creating many targets: you'll end up typing a lot more, a lot of additional files will be created that need to be stored, and the addition of more targets makes it is harder to navigate the makefile
.
How to make decisions on how many targets to use and how targets are defined
We've covered a lot of content about the rules of writing good pipelines, but pipelines are also very flexible! Pipelines can have as many or as few targets as you would like, and targets can be as big or as small as you would like. The key theme for all pipelines is that they are reproducible codebases to document your data analysis process for both humans and machines. In this next section, we will learn about how to make decisions related to the number and types of targets you add to a pipeline.
Background
Isn't it satisfying to work through a fairly lengthy data workflow and then return to the project and it just works? For the past few years, we have been capturing the steps that go into creating results, figures, or tables appearing in data visualizations or research papers. There are recipes for reproducibility used in complex, collaborative modeling projects, such as in this reservoir temperature modeling pipeline and in this pipeline to manage downloads of forecasted meteorological driver data. Note that you need to be able to access internal USGS websites to see these examples and these were developed early on in the Data Science adoption of
targets
so may not showcase all of our adopted best practices.Here is a much simpler example that was used to generate Figure 1 from Water quality data for national‐scale aquatic research: The Water Quality Portal (published in 2017):
This makefile recipe generates a multipanel map, which colors HUC8 watersheds according to how many sites within the watershed have data for various water quality constituents:
The
"figures/multi_panel_constituents.png"
figure takes a while to plot, so it is a somewhat "expensive" target to iterate on when it comes to style, size, colors, and layout (it takes 3 minutes to plot for me). But the plotting expense is dwarfed by the amount of time it takes to build each water quality data "object target", sinceget_wqp_data
uses a web service that queries a large database and returns a result; the process of fetching the data can sometimes take over thirty minutes (nitrogen_all
is a target that contains the locations of all of the sites that have nitrogen water quality data samples).Alternatively, the
map_config*
object above builds in a fraction of second, and contains some simple information that is used to fetch and process the proper boundaries with theget_mutate_HUC8s
function, and includes some plotting details for the final map (such as plotting color divisions).This example, although dated, represents a real project that caused us to think carefully about how many targets we use in a recipe and how complex their underlying functions are. Decisions related to targets are often motivated by the intent of the pipeline. In the case above, our intent at the time was to capture the data and processing behind the plot in the paper in order to satisfy our desire for reproducibility.
:keyboard: Activity: Assign yourself to this issue to get started.
I'll sit patiently until you've assigned yourself to this one.