Open github-learning-lab[bot] opened 3 years ago
Before you edit any code, create a local branch called "scale-up" and push that branch up to the remote location "origin" (which is the github host of your repository).
git checkout main
git pull origin main
git checkout -b scale-up
git push -u origin scale-up
The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to main
and sync with "origin" whenever you're transitioning between branches and/or PRs.
okey dokey
states
[ ] Expand the pipeline to include all of the U.S. states and some territories. Specifically, modify the states
target in _targets.R:
states <- c('AL','AZ','AR','CA','CO','CT','DE','DC','FL','GA','ID','IL','IN','IA',
'KS','KY','LA','ME','MD','MA','MI','MN','MS','MO','MT','NE','NV','NH',
'NJ','NM','NY','NC','ND','OH','OK','OR','PA','RI','SC','SD','TN','TX',
'UT','VT','VA','WA','WV','WI','WY','AK','HI','GU','PR')
[ ] Run tar_make()
once. Note what gets [re]built.
[ ] Run tar_make()
again. Note what gets [re]built.
[ ] For fun, here are some optional math questions. Assume that you just added 46 new states, and each state data pull has a 50% chance of failing.
tar_make()
?tar_make()
should you expect to make before the pipeline is fully built?Comment on what you're seeing.
the internet is failing many times...
tar_make()
Rather than babysitting repeated tar_make()
calls until all the states build, it's time to adapt our approach to running tar_make()
when there are steps plagued by network failures. A lot of times, you just need to retry the download/upload again and it will work. This is not always the case though and sometimes, you need to address the failures. The targets package does not currently offer this fault tolerance, so the approaches discussed here are designed by our group to provide fault tolerance for tasks such as this data pull (including those where the "failures" are all real rather than being largely synthetic as in this project :wink:).
There are a few choices to consider when thinking about fault tolerance in pipelines and they can be separated into two categories - how you want the full pipeline to behave vs. how you want to handle individual targets.
Choices for handling errors in the full pipeline:
If you want the first approach, congrats! That's how the pipeline behaves by default and there is no need for you to change anything. If you want the pipeline to keep going but return to build that target later, you should add error = 'continue'
to your tar_option_set()
call.
Now let's talk about handling errors for individual targets. There are also a few ideas to consider here.
n
times (in case of internet flakyness) before ultimately considering it a failed target. If you want a failure to still be considered a completed build, then consider implementing tryCatch
in your download/upload function to gracefully handle errors, return something (e.g. data.frame()
) from the function, and allow the code to continue. If you want to retry a target before moving on in the pipeline, then we can use the function retry::retry()
. This is a function from the retry package, which you may or may not have installed. Go ahead and check that you have this package before continuing.
Wrapping a target command
with retry()
will keep building that target until there are no errors OR until it runs out of max_tries
. You can also set the when
argument of retry()
to declare what error message should initiate a rebuild (the input to when
can be a regular expression).
[x] Add code to load the package retry at the top of your _targets.R file.
[x] Wrap the get_site_data()
function in your static branching code with retry()
. The retry()
function should look for the error message matching "Ugh, the internet data transfer failed!"
and should rerun get_site_data()
a maximum of 30 times.
[x] Now run tar_make()
. It will redownload data for all of the states since we updated the command
for nwis_data
. It will take awhile since it is downloading all of them at least once and may need to retry some up to 30 times. Grab a tea or coffee while you wait (~ 10 min) - at least there's no babysitting needed!
Comment once you've committed and pushed your changes to GitHub.
ok.
You've just run a fully functioning pipeline with 212 unique branches (53 states x 4 steps)! Imagine if you had coded that with a for
loop and just one of those states threw an error? :grimacing:
Now that you have your pipeline running with static branching, let's try to convert it into the other type of branching, dynamic.
In our scenario, it doesn't matter too much whether we pick static or dynamic branching. Both can work for us. I chose to show you static first because inspecting and building pipelines with dynamic branching can be more mysterious. In dynamic branching, the naming given to each branch target is assigned randomly and dynamic branch targets do not appear in the diagram produced by tar_visnetwork()
. But despite those inconveniences, dynamic branching is needed in many situations in order to build truly robust pipelines, so here we go ...
[x] First, let's drop back down to just a few states while we get this new branching set up. Change to states <- c('WI', 'MN', 'MI')
and run tar_destroy()
to reset our pipeline (note: use tar_destroy()
very sparingly and deliberately, as it wipes all previous pipeline builds).
[x] We no longer need to use tar_map()
to create a separate object containing our static branching output. In dynamic branching, we just add to individual tar_target()
calls. We will move each of the four targets from tar_map()
into the appropriate targets list individually. First, move your "splitter" target nwis_inventory
into your list of targets just after oldest_active_sites
.
[ ] Let's adjust this target to follow dynamic branching concepts. In dynamic branching, the output of each target is always combined back into a single object. So, filter()
ing this dataset by state is not actually going to do anything. Instead of splitting the data apart by a branching variable (remember we used tibble(state_abb = states)
?) as we do in static branching, we will use the state_cd
column from oldest_active_sites
as a grouping variable in preparation of subsequent targets that will be applied over those groups. You need to then add a special targets grouping (tar_group()
) for it to be treated appropriately. Lastly, the default "iteration" in dynamic branching is by list, but we just set it up to use groups, so we need to change that. In the end, your "splitter" target should look like this:
tar_target(nwis_inventory,
oldest_active_sites %>%
group_by(state_cd) %>%
tar_group(),
iteration = "group")
[x] Now to download the data! Copy your tar_target()
code for the nwis_data
target and paste as a target after your nwis_inventory
target. Make two small changes: replace state_abb
with nwis_inventory$state_cd
& add pattern = map(nwis_inventory)
as an argument to tar_target()
. This second part is what turns this into a dynamic branching target. It will apply the retry(get_site_data())
call to each of the groups we defined in nwis_inventory
. Continuing this idea, we can still get the state abbreviation to pass to get_site_data()
by using the state_cd
column from nwis_inventory
. Since we grouped by state_cd
, this will only have the one value.
[x] Add the tallying step. Copy the tar_target()
code for the tally
target into your targets list. Add pattern = map(nwis_data)
as the final argument to tar_target()
to set that up to dynamically branch based on the same branching from the nwis_data
target.
[x] Since dynamic branching automatically combines the output into a single object, the tally
target already represents the combined tallies per state. We no longer need obs_tallies
! Delete that target :) Make sure you update the downstream targets that dependend on obs_tallies
and have them use tally
instead (just data_coverage_png
in our case).
[x] We are on our final step to convert to dynamic branching - our timeseries plots. This is a bit trickier because we were using our static branching values
table to define the PNG filenames, but now won't have available to us. Instead, we will build them as part of our argument directly in the function. First, copy the target code for timeseries_png
into the list of targets. Replace state_plot_files
in the plot_site_data()
command with the sprintf()
command used to define values
in the tar_map()
command, which creates the string with the filename. Replace state_abb
with unique(nwis_data$State)
. Add pattern = map(nwis_data)
as the final argument to tar_target
.
[x] Once again, dynamic branching will automatically combine the output into a single object. For file targets, that means a vector of the filenames. So, we need to change our summary_state_timeseries_csv
target to take advantage of that. First, it can be a regular tar_target()
, so replace tar_combine()
with tar_target()
. Next, delete mapped_by_state_targets$timeseries_png
so that the very next argument to tar_target()
is the command. Edit the second argument to the command to be timeseries_png
instead of !!!.x
. Note that I didn't ask you to add pattern = map()
to this function. We don't need to add pattern
here because we want to use the combined target as input, not each individual filename.
[x] We need to adjust the function we used in that last target because it was setup to handle the output from a static branching step, not a dynamic one. There are two differences: 1) the static branching output for the filenames was not a single object, but a collection of objects (hence, ...
as our second argument for summarize_targets()
), and 2) the individual filenames are known by the static branching approach, but only the hashed target names for the files are known in the dynamic branching approach. To fix the first difference, go to the 3_summarize/src/summarize_targets.R
file and update the function to accept a vector rather than multiple vectors for the second argument. To fix the second difference, go back to your _targets.R
file and add names()
around the input timeseries_png
. This passes in the targets name for the dynamically created files, not just the filenames. Otherwise, tar_meta()
won't know what you are talking about. The last note is that targets v0.5.0.9000 complains about passing in a vector as "ambiguous". You can fix this by wrapping your file vector argument used in tar_meta()
with all_of()
in your summarize_targets()
function to get rid of the message.
[x] Clean up! Delete any of the remaining static branching content. Delete the code that creates mapped_by_state_targets
and make sure that mapped_by_state_targets
does not appear in the your targets list.
[ ] Run tar_visnetwork()
and inspect your pipeline diagram. Do all the steps and dependencies make sense? Do you notice anything that is disconnected from the pipeline? You may have caught this during the previous clean up step, but my pipeline looks like this:
Do you see the function combine_obs_tallies
in the bottom left that is disconnected from the pipeline? There are a few ways to move forward knowing that something is disconnected: 1) fixing it because it should be connected, 2) leaving it knowing that you will need it in the future, or 3) deleting it because it is no longer needed. We will do the third - go ahead and delete that function. It exists in 2_process/src/tally_site_obs.R. Re-run tar_visnetwork()
. It should no longer appear.
[x] Now run tar_make()
. Remember that we set this up to use only three states at first. What do you notice in your console as the pipeline builds with respect to target naming?
[x] Once your pipeline builds using dynamic branching across three states, change your states
object back to the full list,
states <- c('AL','AZ','AR','CA','CO','CT','DE','DC','FL','GA','ID','IL','IN','IA',
'KS','KY','LA','ME','MD','MA','MI','MN','MS','MO','MT','NE','NV','NH',
'NJ','NM','NY','NC','ND','OH','OK','OR','PA','RI','SC','SD','TN','TX',
'UT','VT','VA','WA','WV','WI','WY','AK','HI','GU','PR')
[x] Run tar_visnetwork()
. Is it the same or different since updating the states?
[x] Time to test the full thing! Run tar_make()
. Since we used tar_destroy()
between the last full state build and now, it will take awhile (~ 10 min).
Once you've committed and pushed your changes to GitHub, comment about some of the differences you notice when running this pipeline using the dynamic branching approach vs our original static branching approach. Include a screenshot of the result in your viewer from your last tar_visnetwork()
showing your dynamic branching pipeline.
target names from the branching have random alphanumeric names ie timeseries_png_711a18aa.
This is fun, right? A strong test of code is whether it can be transferred to solve a slightly different problem, so now let's try applying this pipeline to water temperature data instead of discharge data.
I'm about to ask you to do a few tricky/interesting things here with respect to git. Let's acknowledge them even though we won't explore them fully in this course:
origin
pointing to your GitHub repository. This feels weird but turns out to be fine. If you wanted to pass code changes between your local clones, you'd push from one clone up to GitHub, then pull from GitHub into the second clone.The above notes are really just intended to raise your awareness about complicated things you can do with git and GitHub. When you encounter needs or situations like these in real projects, just remember to think before acting, and feel free to ask questions of your teammates to make sure you get the results you intend.
[x] Make a copy of your whole local repo folder. You can just use standard file copying methods (File Explorer, cp
, whatever you want). Name this new top-level folder "ds-pipelines-targets-3-temperature". Create a new RStudio project from an existing directory, using "ds-pipelines-targets-3-temperature" as that directory, and open in a new RStudio session.
[x] In this new project, create a second local branch, this time called "temperature", and push this new branch up to the remote location "origin". Check to make sure you're already on the "scale-up" branch (e.g., with git status
or by looking at the Git tab in RStudio), and then:
git checkout -b temperature
git push -u origin temperature
[x] Change the parameter code (parameter
in _targets.R) from 00060
(flow) to 00010
(water temperature).
[x] Remove 'VT' and 'GU' from the states
target in _targets.R. It turns out that NWIS returns errors for these two states/territories for temperature, so we'll just skip them.
[x] Run library(targets)
(because you're in a new R session).
[x] You copied the whole pipeline directory, and with it, the previous pipeline's _targets/
directory and build status info. Let's wipe that out before we build this new pipeline with temperature data. Double check that you are in your ds-pipelines-targets-3-temperature
RStudio project and then run tar_destroy()
. USE THIS VERY CAUTIOUSLY AS IT WILL CAUSE YOU TO HAVE TO REBUILD EVERYTHING.
[x] Build the full pipeline using tar_make()
. Note the different console messages this time. It would be rare but there might be states that hit our max_tries
cap of 30 and fail. This can create weird errors later in the pipeline. So, if you see some weird errors on some of the visualization steps, try running tar_outdated()
to see if there are incomplete state data targets. If there are, no worries just run tar_make()
again and it should complete.
When everything has run successfully, use a comment to share the images from timeseries_KY.png
and data_coverage.png
. Take a second and peruse the other timeseries_*.png
files. Did you find anything surprising? Include any observations you want to share about the build.
That temperature data build should have worked pretty darn smoothly, with fault tolerance for those data pulls with simulated unreliability, a rebuild of everything that needed a rebuild, and console messages to inform you of progress through the pipeline. Yay!
I'll just share a few additional thoughts and then we'll wrap up.
Orphaned task-step files: You might have been disappointed to note that there's still a timeseries_VT.png hanging out in the repository even though VT is now excluded from the inventory and the summary maps. Worse still, that file shows discharge data! There's no way to use targets to discover and remove such orphaned artifacts of previous builds, because this is a file not connected to the new pipeline and targets doesn't even know that it exists. So it's a weakness of these pipelines that you may want to bear in mind, though this particular weakness has never caused big problems for us. Still, to avoid or fix it, you could:
parameter='00010'
.Forcing rebuilds of targets: One trick I thought I'd be sharing more of in this course is the use of the tar_invalidate()
, which forces a rebuild of specified targets. This can seem necessary when you know that there has been a change but the pipeline is not detecting it. We've used this forced rebuild approach a lot in the past, but I can no longer think of a situation where it's truly necessary. The best use of tar_invalidate()
is as a temporary bandaid or diagnostic tool rather than as a permanent part of your pipeline. Instead of forcing something to rebuild, you should determine the root cause of it being skipped and make sure pipeline is appropriately set up.
The one example where it may really feel necessary is when you want to force a complete redo of downloads from a web service. You could use tar_invalidate()
for this, but in these pipelines courses we've also introduced you to the idea of a dummy
argument to data-downloading functions that you can manually edit to trigger a rebuild. This is especially handy if you use the current datetime as the contents of that dummy variable, because then you have a git-committed record of when you last pulled the data. In our example project you might want a dummy variable that's shared between the inventory data pull and the calls to get_site_data()
.
Fetching results from the targets database: We've already used these functions in this course, but I want to share them again here to help you remember. They're just so handy! To access the current value of a target from your pipeline, just call
tar_load('oldest_active_sites')
or for fetching a file target,
summary_state_timeseries <- readr::read_csv(tar_read('summary_state_timeseries_csv'))
The nice thing about these functions are that they don't take time to rebuild or even check the currentness of a target. It just loads or passes the object to you.
Phew, what a lot you've learned in this course! Let's get your work onto GitHub.
[x] Commit your code changes for the temperature analysis, remembering to commit to the new branch ("temperature"). Push your changes to GitHub. You won't make a PR for this branch - it can just live on as an alternative to the "main" branch that documents the changes needed to analyze temperature instead of discharge.
[ ] Create a PR to merge the "scale-up" branch into "main". In the PR description, post your favorite figure produced during the course and any other observations you want to share.
Your pipeline is looking great, @slevin75! It's time to put it through its paces and experience the benefits of a well-plumbed pipeline. The larger your pipeline becomes, the more useful are the tools you've learned in this course.
In this issue you will:
:keyboard: Activity: Check for targets udpates
Before you get started, make sure you have the most up-to-date version of targets:
You should have package version >= 0.5.0.9002. If you don't, reinstall with: