padilla410 / ds-pipelines-2

https://lab.github.com/USGS-R/scipiper-tips-and-tricks
0 stars 0 forks source link

Strategies for defining targets in data pipelines #3

Closed github-learning-lab[bot] closed 2 years ago

github-learning-lab[bot] commented 2 years ago

How to make decisions on how many targets to use and how targets are defined

background

Isn't it satisfying to work through a fairly lengthy data workflow and then return to the project and it just works? For the past few years, we have been capturing the steps that go into creating results, figures, or tables appearing in data visualizations or research papers. There are recipes for reproducibility used to create complex, interactive data visualizations, such as this water use data viz


Here is a much simpler example that was used to generate Figure 1 from Water quality data for national‐scale aquatic research: The Water Quality Portal (published in 2017):


packages:
  - rgeos
  - dplyr
  - rgdal
  - httr
  - yaml
  - RColorBrewer
  - dataRetrieval
  - lubridate
  - maptools
  - rgeos
  - maps
  - sp

## All R files that are used must be listed here:
sources:
  - R/wqp_mapping_functions.R
  - R/readWQPdataPaged.R

targets:
  all:
    depends: 
      - figures/multi_panel_constituents.png

  map.config:
    command: yaml.load_file("configs/mapping.yml")

  wqp.config:
    command: yaml.load_file("configs/wqp_params.yml")

  huc.map:
    command: get_mutate_HUC8s(map.config)

  phosphorus_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)

  phosphorus_all:
    command: get_wqp_data(target_name, wqp.config, map.config)

  nitrogen_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)

  nitrogen_all:
    command: get_wqp_data(target_name, wqp.config, map.config)

  arsenic_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)

  arsenic_all:
    command: get_wqp_data(target_name, wqp.config, map.config)

  chlorophyll_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)

  chlorophyll_all:
    command: get_wqp_data(target_name, wqp.config, map.config)

  temperature_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)

  temperature_all:
    command: get_wqp_data(target_name, wqp.config, map.config)

  doc_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)

  doc_all:
    command: get_wqp_data(target_name, wqp.config, map.config)

  secchi_all:
    command: get_wqp_data(target_name, wqp.config, map.config)

  secchi_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)

  glyphosate_all:
    command: get_wqp_data(target_name, wqp.config, map.config)

  figures/multi_panel_constituents.png:
    command: plot_huc_panel(huc.map, map.config, target_name, arsenic_lakes, 
      arsenic_all, nitrogen_lakes, nitrogen_all, phosphorus_lakes, phosphorus_all, 
      secchi_lakes, secchi_all, temperature_lakes, temperature_all)
    plot: true

This remakefile recipe generates a multipanel map, which colors HUC8 watersheds according to how many sites within the watershed have data for various water quality constituents: multi_panel_constituents


The "figures/multi_panel_constituents.png" figure takes a while to plot, so it is a somewhat "expensive" target to iterate on when it comes to style, size, colors, and layout (it takes 3 minutes to plot for me). But the plotting expense is dwarfed by the amount of time it takes to build each water quality data "object target", since get_wqp_data uses a web service that queries a large database and returns a result; the process of fetching the data can sometimes take over thirty minutes (nitrogen_all is a target that contains the locations of all of the sites that have nitrogen water quality data samples).

Alternatively, the map.config* object above builds in a fraction of second, and contains some simple information that is used to fetch and process the proper boundaries with the get_mutate_HUC8s function, and includes some plotting details for the final map (such as plotting color divisions as specified by countBins):

map.config build

This example, although dated, represents a real project that caused us to think carefully about how many targets we use in a recipe and how complex their underlying functions are. Decisions related to targets are often motivated by the intent of the pipeline. In the case above, our intent at the time was to capture the data and processing behind the plot in the paper in order to satisfy our desire for reproducibility.


*disclaimer, the code above was written at a time before we'd completely transitioned away from naming variables like.this

:keyboard: Activity: Assign yourself to this issue to get started.


I'll sit patiently until you've assigned yourself to this one.

github-learning-lab[bot] commented 2 years ago

In general, if building part of a pipeline is "expensive" (i.e., takes more than a trivial amount of time for a computer to execute), it should be a separate target. In the example above :point_up:, expensive sections included fetching data and plotting.

Additional reasons to create a target include:


But of course there is a cost to creating many targets: you'll end up typing a lot more, a lot of additional files will be created that need to be stored, and the addition of more targets makes it is harder to navitate the remakefile.


Close this issue when you are ready to move on to the next activity

github-learning-lab[bot] commented 2 years ago


When you are done poking around, check out the next issue.