Strategies for defining targets in data pipelines

How to make decisions on how many targets to use and how targets are defined

background

Isn't it satisfying to work through a fairly lengthy data workflow and then return to the project and it just works? For the past few years, we have been capturing the steps that go into creating results, figures, or tables appearing in data visualizations or research papers. There are recipes for reproducibility used to create complex, interactive data visualizations, such as

Here is a much simpler example that was used to generate Figure 1 from Water quality data for national‐scale aquatic research: The Water Quality Portal (published in 2017):


packages:
  - rgeos
  - dplyr
  - rgdal
  - httr
  - yaml
  - RColorBrewer
  - dataRetrieval
  - lubridate
  - maptools
  - rgeos
  - maps
  - sp

## All R files that are used must be listed here:
sources:
  - R/wqp_mapping_functions.R
  - R/readWQPdataPaged.R

targets:
  all:
    depends: 
      - figures/multi_panel_constituents.png

  map.config:
    command: yaml.load_file("configs/mapping.yml")

  wqp.config:
    command: yaml.load_file("configs/wqp_params.yml")

  huc.map:
    command: get_mutate_HUC8s(map.config)

  phosphorus_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)

  phosphorus_all:
    command: get_wqp_data(target_name, wqp.config, map.config)

  nitrogen_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)

  nitrogen_all:
    command: get_wqp_data(target_name, wqp.config, map.config)

  arsenic_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)

  arsenic_all:
    command: get_wqp_data(target_name, wqp.config, map.config)

  chlorophyll_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)

  chlorophyll_all:
    command: get_wqp_data(target_name, wqp.config, map.config)

  temperature_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)

  temperature_all:
    command: get_wqp_data(target_name, wqp.config, map.config)

  doc_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)

  doc_all:
    command: get_wqp_data(target_name, wqp.config, map.config)

  secchi_all:
    command: get_wqp_data(target_name, wqp.config, map.config)

  secchi_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)

  glyphosate_all:
    command: get_wqp_data(target_name, wqp.config, map.config)

  figures/multi_panel_constituents.png:
    command: plot_huc_panel(huc.map, map.config, target_name, arsenic_lakes, 
      arsenic_all, nitrogen_lakes, nitrogen_all, phosphorus_lakes, phosphorus_all, 
      secchi_lakes, secchi_all, temperature_lakes, temperature_all)
    plot: true

This remakefile recipe generates a multipanel map, which colors HUC8 watersheds according to how many sites within the watershed have data for various water quality constituents: multi_panel_constituents

The "figures/multi_panel_constituents.png" figure takes a while to plot, so it is a somewhat "expensive" target to iterate on when it comes to style, size, colors, and layout (it takes 3 minutes to plot for me). But the plotting expense is dwarfed by the amount of time it takes to build each water quality data "object target", since get_wqp_data uses a web service that queries a large database and returns a result; the process of fetching the data can sometimes take over thirty minutes (nitrogen_all is a target that contains the locations of all of the sites that have nitrogen water quality data samples).

Alternatively, the map.config* object above builds in a fraction of second, and contains some simple information that is used to fetch and process the proper boundaries with the get_mutate_HUC8s function, and includes some plotting details for the final map (such as plotting color divisions as specified by countBins):

map.config build

This example, although dated, represents a real project that caused us to think carefully about how many targets we use in a recipe and how complex their underlying functions are. Decisions related to targets are often motivated by the intent of the pipeline. In the case above, our intent at the time was to capture the data and processing behind the plot in the paper in order to satisfy our desire for reproducibility.

*disclaimer, the code above was written at a time before we'd completely transitioned away from naming variables like.this

:keyboard: Activity: Assign yourself to this issue to get started.

I'll sit patiently until you've assigned yourself to this one.

In general, if building part of a pipeline is "expensive" (i.e., takes more than a trivial amount of time for a computer to execute), it should be a separate target. In the example above :point_up:, expensive sections included fetching data and plotting.

Additional reasons to create a target include:

If some element in the pipeline may fail (such as downloading data from the internet), isolating this brittle part of the project as a target with a corresponding function makes it faster to get past. This is because your target focuses on accomplishing only the brittle step, instead of, for example, also attempting to process and plot downloaded data all within the same function.
Sometimes a target is created in order to make it easier to defer a decision for later. If we have an expensive geoprocessing task but the methods for the final way of summarizing the results is in flux, it might make sense to break this function and target into two functions and two targets: the major parts of the geoprocessing step in one function-target pair, and the smaller summary process in the second.
Targets are easy to inspect and dig into (e.g., my_target <- scmake('my_target'), or reading in a file that was created). If there is an intermediate step in a workflow that will likely need to be examined, it may deserve a target.
Lastly, if a configuration or value is shared accross many other targets, the configuration itself might deserve a stand alone target, even if generating that target is computationally cheap. In our water quality data pull example, the wqp.config target is an example of a shared configuration. Within that target, there is (among other things) a string that specifies how lake sites are queried in the web service. If that query parameter changes in the future, making the change to the file behind the wqp.config target would propagate into the necessary updates to the data pulls run with get_wqp_data.

But of course there is a cost to creating many targets: you'll end up typing a lot more, a lot of additional files will be created that need to be stored, and the addition of more targets makes it is harder to navitate the remakefile.

padilla410 / ds-pipelines-2