Closed github-learning-lab[bot] closed 3 years ago
Before you edit any code, create a local branch called "combiners" and push that branch up to the remote location "origin" (which is the github host of your repository).
git checkout main
git pull origin main
git checkout -b combiners
git push -u origin combiners
The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to main
and sync with "origin" whenever you're transitioning between branches and/or PRs.
ok
combine_obs_tallies()
[x] Add a new function called combine_obs_tallies
somewhere in 2_process/src/tally_site_obs.R. The function declaration should be function(...)
; when the function is actually called, you can anticipate that the arguments will be a bunch of tallies tibbles (tidyverse data frames). Your function should return the concatenation of these tibbles into one very tall tibble.
[x] Test your combine_obs_tallies()
function. Run
source('2_process/src/tally_site_obs.R') # load `combine_obs_tallies()`
tar_load(tally_WI)
tar_load(tally_MN)
tar_load(tally_IL)
combine_obs_tallies(tally_WI, tally_MN, tally_IL)
The result should be a tibble with four columns and as many rows as the sum of the number of rows in WI_tally
, MN_tally
, and IL_tally
. If you don't have it right yet, keep fiddling and/or ask for help.
combine_obs_tallies()
[ ] Move your static branching setup outside of your targets list and save above as an object called mapped_by_state_targets
. It should look something like
mapped_by_state_targets <- tar_map(...)
list(
tar_target(oldest_active_sites, ...),
tar_target(site_map_png, ...)
)
[ ] Now add mapped_by_state_targets
as a target between oldest_active_sites
and site_map_png
in your list of targets.
[ ] Add unlist=FALSE
to your tar_map()
call, so that we can reference only the branch targets from the tally
step in tar_combine()
.
[ ] Add a new target between mapped_by_state_targets
and the site_map_png
target called obs_tallies
. Instead of tar_target()
, this will use tar_combine()
.
[ ] Populate your tar_combine()
call with input for just the tally branches by subsetting the tar_map()
output object, and the appropriate call to combine_obs_tallies()
for the command
(remember you will need !!!.x
).
Run tar_make()
and then tar_load(obs_tallies)
. Inspect the value of obs_tallies
. Is it what you expected?
When you're feeling confident, add a comment to this issue with your answer to the question above.
whew - that was a tough one!
_Inspect the value of obs_tallies
. Is it what you expected?_
Here's what my obs_tallies
looks like. Your number of rows might vary slightly if you build this at a time when the available data have changed substantially, but the column structure and approximate number of rows ought to be about the same. If it looks like this, then it meets my expectations and hopefully also yours.
> obs_tallies
# A tibble: 738 x 4
# Groups: Site, State [6]
Site State Year NumObs
<chr> <chr> <dbl> <int>
1 04073500 WI 1898 365
2 04073500 WI 1899 365
3 04073500 WI 1900 365
4 04073500 WI 1901 365
5 04073500 WI 1902 365
6 04073500 WI 1903 365
7 04073500 WI 1904 366
8 04073500 WI 1905 365
9 04073500 WI 1906 365
10 04073500 WI 1907 365
# … with 728 more rows
Comment on this issue when you're ready to proceed.
looks ok
It's time to reap the rewards from your first combiner.
[ ] Create a new target in _targets.R that takes advantage of your new combined tallies. Use the plot_data_coverage()
function already defined for you (find it by searching or browsing the repository - remember Ctrl-Shift-F.
), and pass in obs_tallies
as the oldest_site_tallies
argument. Set up your target to create a file named "3_visualize/out/data_coverage.png"
and name the target appropriately. Remember to add a source()
call to load the file with the new function near the top of _targets.R. Add this to your list()
of targets after obs_tallies
but before site_map_png
, so that it is connected to the main pipeline.
[ ] Test your new target by running tar_make()
, then checking out 3_visualize/out/data_coverage.png.
[ ] Test your new pipeline by removing a state from states
and running tar_make()
once more. Did 3_visualize/out/data_coverage.png get revised? If not, see if you can figure out how to make it so. Ask for help if you need it.
When you've got it, share the image in 3_visualize/out/data_coverage.png as a comment.
Great, you have a combiner hooked up from start to finish, and you probably learned some things along the way! It's time to add a second combiner that serves a different purpose - here, rather than produce a target that contains the data of interest, we'll produce a combiner target that summarizes the outputs of interest (in this case the state-specific .png files we've already created).
While this isn't necessary for the pipeline to operate, summarizing file output in large pipelines can be advantageous in some circumstances. Mainly, when we want to version control information about parts of the pipeline that were updated for ourselves or collaborators. We can't check in R object targets to GitHub and we usually avoid checking in data files (e.g. PNGs, CSVs, etc) to GitHub because of the file sizes. So, instead, we can combine some metadata about the file targets generated in the pipeline into a small text file, save in a log/
folder, and commit that to GitHub. Then, any future runs of the pipeline that change any of the metadata we include in the summary file would be tracked as a change to that file.
The first step is to write a custom function to take a number of target names and generate a summary file using output from tar_meta()
. We will refer to this file as an indicator file, where the file lines indicate the hash of the file. We will save as a CSV so that individual lines of the CSV can be tracked as changed or not. See below for a function that does exactly this!
summarize_targets <- function(ind_file, ...) {
ind_tbl <- tar_meta(c(...)) %>%
select(tar_name = name, filepath = path, hash = data) %>%
mutate(filepath = unlist(filepath))
readr::write_csv(ind_tbl, ind_file)
return(ind_file)
}
[ ] Inspect the code within summarize_targets()
[ ] Run the code to create summarize_targets()
as a function in your local environment.
[ ] Test it out with a command such as
summarize_targets('test.csv', site_map_png, oldest_active_sites)
Check out the contents of test.csv. Then when you're feeling clear on what happened, delete test.csv and clear your R Global Environment.
summarize_targets()
[ ] Copy/paste the summarize_targets()
function to its own R script called 2_process/src/summarize_targets.R
.
[ ] Add this new file to the pipeline by including a call to source()
near the top of _targets.R
.
[ ] Add another target after obs_tallies
to build this second combiner. The new line should be:
tar_combine(
summary_state_timeseries_csv,
mapped_by_state_targets,
command = summarize_targets('3_visualize/log/summary_state_timeseries.csv', !!!.x),
format="file"
)
Note the use of the log/
directory. The template repo had already set up any src/
and out/
folders for you, but 3_visualize/log/
does not exist yet. Before you can build this target, you will need to create this directory. Otherwise, the pipeline will throw an error.
[ ] Run tar_make()
. Inspect '3_visualize/log/summary_state_timeseries.csv'
. Is that what you expect?
summary_state_timeseries_csv
Hmm, you probably just discovered that 3_visualize/log/summary_state_timeseries.csv used summarize_targets()
for the download
, tally
, AND plot
steps of the static branching. We could do that but what we really wanted to know was the metadata status for the plot file outputs only.
[ ] Adjust the input to tar_combine()
for summary_state_timeseries_csv
so that ONLY the timeseries plot step of mapped_by_state_targets
is being passed into the combiner function.
[ ] Now run tar_make()
again, and check out 3_visualize/log/summary_state_timeseries.csv once more. Do you only have the PNG files showing up now?
When you're feeling confident, add a comment to this issue with the contents of 3_visualize/out/data_coverage.png, 3_visualize/log/summary_state_timeseries.csv, and the figure generated by tar_visnetwork()
.
You're down to the last task for this issue! I hope you'll find this one rewarding. After all your hard work, you're now in a position to create a leaflet map that will give you interactive access to the locations, identities, and timeseries plots of the Upper Midwest's oldest gages, all in one .html map. Ready?
[ ] Add another target to _targets.R that uses the function map_timeseries()
(defined for you in 3_visualize
). site_info
should be oldest_active_sites
, plot_info
should be summary_state_timeseries_csv
, and the output should be written to 3_visualize/out/timeseries_map.html
. Name this target appropriately and put as the final target in your list.
[ ] Add the three packages that map_timeseries()
requires to the declaration in tar_option_set()
at the top of _targets.R: leaflet
, leafpop
, and htmlwidgets
.
[ ] Run tar_make()
. Any surprises?
[ ] Check out the results of your new map by opening 3_visualize/out/timeseries_map.html in the browser. You should be able to hover and click on each marker.
[ ] Add or subtract a state from the states
vector and rerun tar_make()
. Did you see the rebuilds and non-rebuilds that you expected? Did the html file change as expected?
It's finally time to submit your work.
[ ] Commit your code changes for this issue and make sure you're .gitignore
ing the new analysis products (the .png and .html files), but include your new file in the log/
directory. Push your changes to the GitHub repo.
[ ] Create a PR to merge the "combiners" branch into "main". Share a screenshot of 3_visualize/out/timeseries_map.html and any thoughts you want to share in the PR description.
realized my previous comment had an error in it because the summary_state_timeseries paths were NA. Have fixed in the code so everything works now.
So far we've implemented split and apply operations; now it's time to explore combine operations in targets pipelines.
In this issue you'll add two combiners to serve different purposes - the first will combine all of the annual observation tallies into one giant table, and the second will summarize the set of state-specific timeseries plots generated by the task table.
Background
Approach
Given your current level of knowledge, if you were asked to add a target combining the tally outputs you would likely add a call to
tar_target
and use the branches as input to acommand
that aggregated the data. While this would certainly work, the number of inputs to a combiner should change if the number of tasks changes. If we hand-coded a combiner target withtar_target
that accepted a set of inputs (e.g.,tar_target(combined_tallies, combine_tallies(tally_WI, tally_MI, [etc]))
), we'd need to manually edit the inputs to that function anytime we changed thestates
vector. That would be a pain and would make our pipeline susceptible to human error if we forgot or made a mistake in that editing.Implementation
The targets way to use combiners for static branching is to work with the
tar_combine()
function (recall that combiners are automatically applied to the output in dynamic branching).tar_combine()
is set up in a similar way totar_target()
, where you supply the target name and a function to the target as thecommand
. The difference is that the input to thecommand
will be multiple targets passed in to the...
argument. The output from atar_combine()
can be an R object or a file, but file targets need to haveformat = "file"
passed in totar_combine()
and the function used ascommand
must return the filepath.Some additional implementation considerations:
In order to use
tar_combine()
with the output fromtar_map()
, you will need to save the output oftar_map()
as an object. Thus, the branching declaration should look something likemapped_output <- tar_map()
so thatmapped_output
can be used in yourtar_combine()
call.You can write your own combiner function or you can use built-in combiner functions for common types of combining (such as
bind_rows()
,c()
, etc). If you write your own combiner function, it needs to be in a script sourced in the makefile usingsource()
. The default combiner is?vctrs::vec_c
, which is a a fancy version ofc()
that ensures the resulting vector keeps the common data type (e.g. factors remain factors).When you pass the output of
tar_map()
totar_combine()
, all branch output fromtar_map()
will be used by default. If you had multiple steps in yourtar_map()
(i.e. multiple calls totar_target()
), and you only want to combine results from one of those, you can addunlist = FALSE
to yourtar_map()
call so that thetar_map()
output remained in a nested list. This makes it possible to reference just the output from eachtar_target()
and use intar_combine()
. For example, if you had three steps in yourtap_map()
call and you wanted to combine only those branches from the third step that had a target name ofsum_resuts
, you could usemapped_output[[3]]
ormapped_output$sum_results
as the input totar_combine()
.Within your
tar_combine()
function, pass the...
to yourcommand
function by specifying!!!.x
in its place. It feels strange, but has to do with how the function handles non-standard evaluation. You can see an example of using this syntax when you look at the default forcommand
in thehelp file for?tarchetypes::tar_combine()
.When specifying the
command
argument totar_combine()
, you need to include the argument, e.g.command = my_function()
. Sincetar_combine()
has...
as its second argument, anything else you pass in without the argument name will be considered part of...
. It can result in some weird errors.Don't worry if not all of this clicked yet. We are about to see it all in action!