Scale up - Githubissues

Your pipeline is looking great, @slevin75! It's time to put it through its paces and experience the benefits of a well-plumbed pipeline. The larger your pipeline becomes, the more useful are the tools you've learned in this course.

In this issue you will:

Expand the pipeline to include all of the U.S. states and some territories
Learn one more method for making pipelines more robust to internet failures
Practice the other branching method, dynamic branching
Modify the pipeline to describe temperature sites instead of discharge sites

:keyboard: Activity: Check for targets udpates

Before you get started, make sure you have the most up-to-date version of targets:

packageVersion('targets')
## [1] ‘0.5.0.9002’

You should have package version >= 0.5.0.9002. If you don't, reinstall with:

remotes::install_github('ropensci/targets')

:keyboard: Activity: Switch to a new branch

Before you edit any code, create a local branch called "scale-up" and push that branch up to the remote location "origin" (which is the github host of your repository).

git checkout main
git pull origin main
git checkout -b scale-up
git push -u origin scale-up

The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to main and sync with "origin" whenever you're transitioning between branches and/or PRs.

Comment on this issue once you've created and pushed the "scale-up" branch.

:keyboard: Activity: Include all the states

Expand `states`

[ ] Expand the pipeline to include all of the U.S. states and some territories. Specifically, modify the states target in _targets.R:

states <- c('AL','AZ','AR','CA','CO','CT','DE','DC','FL','GA','ID','IL','IN','IA',
          'KS','KY','LA','ME','MD','MA','MI','MN','MS','MO','MT','NE','NV','NH',
          'NJ','NM','NY','NC','ND','OH','OK','OR','PA','RI','SC','SD','TN','TX',
          'UT','VT','VA','WA','WV','WI','WY','AK','HI','GU','PR')

Test

[ ] Run tar_make() once. Note what gets [re]built.
[ ] Run tar_make() again. Note what gets [re]built.
[ ] For fun, here are some optional math questions. Assume that you just added 46 new states, and each state data pull has a 50% chance of failing.
1. What are the odds of completing all the data pulls in a single call to tar_make()?
2. How many calls to tar_make() should you expect to make before the pipeline is fully built?

Comment on what you're seeing.

I'll respond when I see your comment.

the internet is failing many times...

:keyboard: Activity: Use fault tolerant approaches to running `tar_make()`

Rather than babysitting repeated tar_make() calls until all the states build, it's time to adapt our approach to running tar_make() when there are steps plagued by network failures. A lot of times, you just need to retry the download/upload again and it will work. This is not always the case though and sometimes, you need to address the failures. The targets package does not currently offer this fault tolerance, so the approaches discussed here are designed by our group to provide fault tolerance for tasks such as this data pull (including those where the "failures" are all real rather than being largely synthetic as in this project :wink:).

Understand your options

There are a few choices to consider when thinking about fault tolerance in pipelines and they can be separated into two categories - how you want the full pipeline to behave vs. how you want to handle individual targets.

Choices for handling errors in the full pipeline:

You want the pipeline build to come to a grinding hault if any of the targets throw an error.
You want to come back and rebuild the target that is failing but not let that stop other targets from building.

If you want the first approach, congrats! That's how the pipeline behaves by default and there is no need for you to change anything. If you want the pipeline to keep going but return to build that target later, you should add error = 'continue' to your tar_option_set() call.

Now let's talk about handling errors for individual targets. There are also a few ideas to consider here.

If the target fails, you want that target to return no data and keep going.
If the target fails, you want to retry building that target n times (in case of internet flakyness) before ultimately considering it a failed target.

If you want a failure to still be considered a completed build, then consider implementing tryCatch in your download/upload function to gracefully handle errors, return something (e.g. data.frame()) from the function, and allow the code to continue. If you want to retry a target before moving on in the pipeline, then we can use the function retry::retry(). This is a function from the retry package, which you may or may not have installed. Go ahead and check that you have this package before continuing.

Wrapping a target command with retry() will keep building that target until there are no errors OR until it runs out of max_tries. You can also set the when argument of retry() to declare what error message should initiate a rebuild (the input to when can be a regular expression).

Test

[x] Add code to load the package retry at the top of your _targets.R file.
[x] Wrap the get_site_data() function in your static branching code with retry(). The retry() function should look for the error message matching "Ugh, the internet data transfer failed!" and should rerun get_site_data() a maximum of 30 times.
[x] Now run tar_make(). It will redownload data for all of the states since we updated the command for nwis_data. It will take awhile since it is downloading all of them at least once and may need to retry some up to 30 times. Grab a tea or coffee while you wait (~ 10 min) - at least there's no babysitting needed!

Commit

[x] Commit and push your changes to GitHub. No need to make a PR yet, though; we'll keep working on this issue for a few more minutes first before getting human feedback.

Comment once you've committed and pushed your changes to GitHub.

I'll respond when I see your comment.

You've just run a fully functioning pipeline with 212 unique branches (53 states x 4 steps)! Imagine if you had coded that with a for loop and just one of those states threw an error? :grimacing:

Now that you have your pipeline running with static branching, let's try to convert it into the other type of branching, dynamic.

:keyboard: Activity: Switch to dynamic branching

In our scenario, it doesn't matter too much whether we pick static or dynamic branching. Both can work for us. I chose to show you static first because inspecting and building pipelines with dynamic branching can be more mysterious. In dynamic branching, the naming given to each branch target is assigned randomly and dynamic branch targets do not appear in the diagram produced by tar_visnetwork(). But despite those inconveniences, dynamic branching is needed in many situations in order to build truly robust pipelines, so here we go ...

Convert to dynamic branching

[x] First, let's drop back down to just a few states while we get this new branching set up. Change to states <- c('WI', 'MN', 'MI') and run tar_destroy() to reset our pipeline (note: use tar_destroy() very sparingly and deliberately, as it wipes all previous pipeline builds).
[x] We no longer need to use tar_map() to create a separate object containing our static branching output. In dynamic branching, we just add to individual tar_target() calls. We will move each of the four targets from tar_map() into the appropriate targets list individually. First, move your "splitter" target nwis_inventory into your list of targets just after oldest_active_sites.
[ ] Let's adjust this target to follow dynamic branching concepts. In dynamic branching, the output of each target is always combined back into a single object. So, filter()ing this dataset by state is not actually going to do anything. Instead of splitting the data apart by a branching variable (remember we used tibble(state_abb = states)?) as we do in static branching, we will use the state_cd column from oldest_active_sites as a grouping variable in preparation of subsequent targets that will be applied over those groups. You need to then add a special targets grouping (tar_group()) for it to be treated appropriately. Lastly, the default "iteration" in dynamic branching is by list, but we just set it up to use groups, so we need to change that. In the end, your "splitter" target should look like this:
```
tar_target(nwis_inventory,
        oldest_active_sites %>%
         group_by(state_cd) %>%
         tar_group(),
       iteration = "group")
```
[x] Now to download the data! Copy your tar_target() code for the nwis_data target and paste as a target after your nwis_inventory target. Make two small changes: replace state_abb with nwis_inventory$state_cd & add pattern = map(nwis_inventory) as an argument to tar_target(). This second part is what turns this into a dynamic branching target. It will apply the retry(get_site_data()) call to each of the groups we defined in nwis_inventory. Continuing this idea, we can still get the state abbreviation to pass to get_site_data() by using the state_cd column from nwis_inventory. Since we grouped by state_cd, this will only have the one value.
[x] Add the tallying step. Copy the tar_target() code for the tally target into your targets list. Add pattern = map(nwis_data) as the final argument to tar_target() to set that up to dynamically branch based on the same branching from the nwis_data target.
[x] Since dynamic branching automatically combines the output into a single object, the tally target already represents the combined tallies per state. We no longer need obs_tallies! Delete that target :) Make sure you update the downstream targets that dependend on obs_tallies and have them use tally instead (just data_coverage_png in our case).
[x] We are on our final step to convert to dynamic branching - our timeseries plots. This is a bit trickier because we were using our static branching values table to define the PNG filenames, but now won't have available to us. Instead, we will build them as part of our argument directly in the function. First, copy the target code for timeseries_png into the list of targets. Replace state_plot_files in the plot_site_data() command with the sprintf() command used to define values in the tar_map() command, which creates the string with the filename. Replace state_abb with unique(nwis_data$State). Add pattern = map(nwis_data) as the final argument to tar_target.
[x] Once again, dynamic branching will automatically combine the output into a single object. For file targets, that means a vector of the filenames. So, we need to change our summary_state_timeseries_csv target to take advantage of that. First, it can be a regular tar_target(), so replace tar_combine() with tar_target(). Next, delete mapped_by_state_targets$timeseries_png so that the very next argument to tar_target() is the command. Edit the second argument to the command to be timeseries_png instead of !!!.x. Note that I didn't ask you to add pattern = map() to this function. We don't need to add pattern here because we want to use the combined target as input, not each individual filename.
[x] We need to adjust the function we used in that last target because it was setup to handle the output from a static branching step, not a dynamic one. There are two differences: 1) the static branching output for the filenames was not a single object, but a collection of objects (hence, ... as our second argument for summarize_targets()), and 2) the individual filenames are known by the static branching approach, but only the hashed target names for the files are known in the dynamic branching approach. To fix the first difference, go to the 3_summarize/src/summarize_targets.R file and update the function to accept a vector rather than multiple vectors for the second argument. To fix the second difference, go back to your _targets.R file and add names() around the input timeseries_png. This passes in the targets name for the dynamically created files, not just the filenames. Otherwise, tar_meta() won't know what you are talking about. The last note is that targets v0.5.0.9000 complains about passing in a vector as "ambiguous". You can fix this by wrapping your file vector argument used in tar_meta() with all_of() in your summarize_targets() function to get rid of the message.
[x] Clean up! Delete any of the remaining static branching content. Delete the code that creates mapped_by_state_targets and make sure that mapped_by_state_targets does not appear in the your targets list.
[ ] Run tar_visnetwork() and inspect your pipeline diagram. Do all the steps and dependencies make sense? Do you notice anything that is disconnected from the pipeline? You may have caught this during the previous clean up step, but my pipeline looks like this:

Do you see the function combine_obs_tallies in the bottom left that is disconnected from the pipeline? There are a few ways to move forward knowing that something is disconnected: 1) fixing it because it should be connected, 2) leaving it knowing that you will need it in the future, or 3) deleting it because it is no longer needed. We will do the third - go ahead and delete that function. It exists in 2_process/src/tally_site_obs.R. Re-run tar_visnetwork(). It should no longer appear.

Test

[x] Now run tar_make(). Remember that we set this up to use only three states at first. What do you notice in your console as the pipeline builds with respect to target naming?
[x] Once your pipeline builds using dynamic branching across three states, change your states object back to the full list,

states <- c('AL','AZ','AR','CA','CO','CT','DE','DC','FL','GA','ID','IL','IN','IA',
            'KS','KY','LA','ME','MD','MA','MI','MN','MS','MO','MT','NE','NV','NH',
            'NJ','NM','NY','NC','ND','OH','OK','OR','PA','RI','SC','SD','TN','TX',
            'UT','VT','VA','WA','WV','WI','WY','AK','HI','GU','PR')

[x] Run tar_visnetwork(). Is it the same or different since updating the states?
[x] Time to test the full thing! Run tar_make(). Since we used tar_destroy() between the last full state build and now, it will take awhile (~ 10 min).

Commit

[x] Commit and push your changes to GitHub. No need to make a PR yet, though; we'll keep working on this issue for a few more minutes first before getting human feedback.

Once you've committed and pushed your changes to GitHub, comment about some of the differences you notice when running this pipeline using the dynamic branching approach vs our original static branching approach. Include a screenshot of the result in your viewer from your last tar_visnetwork() showing your dynamic branching pipeline.

I'll respond when I see your comment.

target names from the branching have random alphanumeric names ie timeseries_png_711a18aa.

This is fun, right? A strong test of code is whether it can be transferred to solve a slightly different problem, so now let's try applying this pipeline to water temperature data instead of discharge data.

Background: Multiple git branches

I'm about to ask you to do a few tricky/interesting things here with respect to git. Let's acknowledge them even though we won't explore them fully in this course:

You'll be copying a git repository locally. Soon you'll have two local clones of the repository, both with origin pointing to your GitHub repository. This feels weird but turns out to be fine. If you wanted to pass code changes between your local clones, you'd push from one clone up to GitHub, then pull from GitHub into the second clone.
You'll be branching off from code that is not yet merged to the "main" branch - this means that if you were to create a PR for the "temperature" branch you'd be committing all the changes from "scale-up" as well as those you make on this new branch. Alternatively, if you created and merged a PR for "scale-up" first, then a second PR of "temperature" would only show those changes specific to this branch. These considerations do come up in real projects - the key is to know what will happen so that you can make a decision about what you want to happen.
Your new branch will be for a new parameter and so will have entirely different data from the discharge branch, while also having almost identical code. Consequently, I won't encourage you to merge "temperature" into the "main" branch; instead, you can just keep two live branches on GitHub, side by side. We sometimes use this approach of having multiple live branches to keep track of several manifestations of vizzies for different social media platforms - most of the content is the same, but the output file size and other final touches differ. It can be time intensive to apply updates to all the branches, so it's easiest to wait until the very end to branch out (once most of the shared code development is complete), but mulitple branches are certainly doable and can be useful in some projects.

The above notes are really just intended to raise your awareness about complicated things you can do with git and GitHub. When you encounter needs or situations like these in real projects, just remember to think before acting, and feel free to ask questions of your teammates to make sure you get the results you intend.

:keyboard: Activity: Repurpose the pipeline

[x] Make a copy of your whole local repo folder. You can just use standard file copying methods (File Explorer, cp, whatever you want). Name this new top-level folder "ds-pipelines-targets-3-temperature". Create a new RStudio project from an existing directory, using "ds-pipelines-targets-3-temperature" as that directory, and open in a new RStudio session.
[x] In this new project, create a second local branch, this time called "temperature", and push this new branch up to the remote location "origin". Check to make sure you're already on the "scale-up" branch (e.g., with git status or by looking at the Git tab in RStudio), and then:
```
git checkout -b temperature
git push -u origin temperature
```
[x] Change the parameter code (parameter in _targets.R) from 00060 (flow) to 00010 (water temperature).
[x] Remove 'VT' and 'GU' from the states target in _targets.R. It turns out that NWIS returns errors for these two states/territories for temperature, so we'll just skip them.

Test

[x] Run library(targets) (because you're in a new R session).
[x] You copied the whole pipeline directory, and with it, the previous pipeline's _targets/ directory and build status info. Let's wipe that out before we build this new pipeline with temperature data. Double check that you are in your ds-pipelines-targets-3-temperature RStudio project and then run tar_destroy(). USE THIS VERY CAUTIOUSLY AS IT WILL CAUSE YOU TO HAVE TO REBUILD EVERYTHING.
[x] Build the full pipeline using tar_make(). Note the different console messages this time. It would be rare but there might be states that hit our max_tries cap of 30 and fail. This can create weird errors later in the pipeline. So, if you see some weird errors on some of the visualization steps, try running tar_outdated() to see if there are incomplete state data targets. If there are, no worries just run tar_make() again and it should complete.

When everything has run successfully, use a comment to share the images from timeseries_KY.png and data_coverage.png. Take a second and peruse the other timeseries_*.png files. Did you find anything surprising? Include any observations you want to share about the build.

I'll respond when I see your comment.

That temperature data build should have worked pretty darn smoothly, with fault tolerance for those data pulls with simulated unreliability, a rebuild of everything that needed a rebuild, and console messages to inform you of progress through the pipeline. Yay!

I'll just share a few additional thoughts and then we'll wrap up.

Ruminations and tricks

Orphaned task-step files: You might have been disappointed to note that there's still a timeseries_VT.png hanging out in the repository even though VT is now excluded from the inventory and the summary maps. Worse still, that file shows discharge data! There's no way to use targets to discover and remove such orphaned artifacts of previous builds, because this is a file not connected to the new pipeline and targets doesn't even know that it exists. So it's a weakness of these pipelines that you may want to bear in mind, though this particular weakness has never caused big problems for us. Still, to avoid or fix it, you could:

After building everything, sort the contents of 3_visualize/out by datestamp and manually remove the files older than your switch to parameter='00010'.
Before you ever went to build the temperature version, you could have deleted all the files in the out folders. Then the new output files would get written into fresh, empty folders.

Forcing rebuilds of targets: One trick I thought I'd be sharing more of in this course is the use of the tar_invalidate(), which forces a rebuild of specified targets. This can seem necessary when you know that there has been a change but the pipeline is not detecting it. We've used this forced rebuild approach a lot in the past, but I can no longer think of a situation where it's truly necessary. The best use of tar_invalidate() is as a temporary bandaid or diagnostic tool rather than as a permanent part of your pipeline. Instead of forcing something to rebuild, you should determine the root cause of it being skipped and make sure pipeline is appropriately set up.

The one example where it may really feel necessary is when you want to force a complete redo of downloads from a web service. You could use tar_invalidate() for this, but in these pipelines courses we've also introduced you to the idea of a dummy argument to data-downloading functions that you can manually edit to trigger a rebuild. This is especially handy if you use the current datetime as the contents of that dummy variable, because then you have a git-committed record of when you last pulled the data. In our example project you might want a dummy variable that's shared between the inventory data pull and the calls to get_site_data().

Fetching results from the targets database: We've already used these functions in this course, but I want to share them again here to help you remember. They're just so handy! To access the current value of a target from your pipeline, just call

tar_load('oldest_active_sites')

or for fetching a file target,

summary_state_timeseries <- readr::read_csv(tar_read('summary_state_timeseries_csv'))

The nice thing about these functions are that they don't take time to rebuild or even check the currentness of a target. It just loads or passes the object to you.

Make a pull request

Phew, what a lot you've learned in this course! Let's get your work onto GitHub.

[x] Commit your code changes for the temperature analysis, remembering to commit to the new branch ("temperature"). Push your changes to GitHub. You won't make a PR for this branch - it can just live on as an alternative to the "main" branch that documents the changes needed to analyze temperature instead of discharge.
[ ] Create a PR to merge the "scale-up" branch into "main". In the PR description, post your favorite figure produced during the course and any other observations you want to share.

slevin75 / ds-pipelines-targets-3

Scale up #12

:keyboard: Activity: Check for targets udpates

:keyboard: Activity: Switch to a new branch

Comment on this issue once you've created and pushed the "scale-up" branch.

:keyboard: Activity: Include all the states

Expand `states`

Test

I'll respond when I see your comment.

:keyboard: Activity: Use fault tolerant approaches to running `tar_make()`

Understand your options

Test

Commit

I'll respond when I see your comment.

:keyboard: Activity: Switch to dynamic branching

Convert to dynamic branching

Test

Commit

I'll respond when I see your comment.

Background: Multiple git branches

:keyboard: Activity: Repurpose the pipeline

Test

I'll respond when I see your comment.

Ruminations and tricks

Make a pull request

I'll respond when I see your PR.

slevin75 / ds-pipelines-targets-3

Scale up #12

:keyboard: Activity: Check for targets udpates

:keyboard: Activity: Switch to a new branch

Comment on this issue once you've created and pushed the "scale-up" branch.

:keyboard: Activity: Include all the states

Expand states

Test

I'll respond when I see your comment.

:keyboard: Activity: Use fault tolerant approaches to running tar_make()

Understand your options

Test

Commit

I'll respond when I see your comment.

:keyboard: Activity: Switch to dynamic branching

Convert to dynamic branching

Test

Commit

I'll respond when I see your comment.

Background: Multiple git branches

:keyboard: Activity: Repurpose the pipeline

Test

I'll respond when I see your comment.

Ruminations and tricks

Make a pull request

I'll respond when I see your PR.

Expand `states`

:keyboard: Activity: Use fault tolerant approaches to running `tar_make()`