jordansread / ds-pipelines-2

https://lab.github.com/USGS-R/scipiper-tips-and-tricks
0 stars 0 forks source link

How to get past the gotchas without getting gotten again #2

Closed github-learning-lab[bot] closed 4 years ago

github-learning-lab[bot] commented 4 years ago

In this section, we're going to go one by one through a series of tips that will help you avoid common pitfalls (or gotchas!) in pipelines. These tips will help you in the next sections and in future work. A quick list of what's to come:

:keyboard: add a comment to this issue and the bot will respond with the next topic


I'll sit patiently until you comment

jordansread commented 4 years ago

jkl

github-learning-lab[bot] commented 4 years ago

How to inspect parts of the pipeline and variables within functions

If you've written your own functions or scripts before, you may have run into the red breakpoint dot :red_circle: on the left side oyour script window:

breakpoint

Breakpoints allow you to run a function (or script) up until the line of the breakpoint, and then the evaluation pauses. You are able to inspect all variables available at that point in the evaluation, and even step carefully forward one line at a time. It is out of scope of this exercise to go through exactly how to use debuggers, but they are powerful and helpful tools. It would be a good idea to read up on them if you haven't run into breakpoints yet.

In scipiper, you can't set a breakpoint in the "normal" way, which would be clicking on the line number after you sourced the script. Instead, you need to use the other method for debugging in R, which requires adding the function call browser() to the line where you'd like the function call to stop.


You have a working, albeit brittle, pipeline in your course repository. You can try it out with scipiper::scmake(). This pipeline has a number of things you'll work to fix later, but for now, it is a useful reference. The pipeline contains several functions which are defined in .R files.

So, if you wanted to look at what download_files where created within the download_nwis_data function, you could set a breakpoint by adding browser() to the "1_fetch/src/get_nwis_data.R" file:

browser()

Then, running scmake() will land you right in the middle of line 8. Give it a try on your own.


:keyboard: comment on where you think you might find browser() handy in future pipelines.


I'll sit patiently until you comment

jordansread commented 4 years ago

f

github-learning-lab[bot] commented 4 years ago

Visualizing and understanding the status of dependencies in a pipeline

Seeing the structure of a pipeline as a visual is powerful. Viewing connections between targets and the direction data is flowing in can help you better understand the role of pipelines in data science work. Once you are more familiar with pipelines, using the same visuals can help you diagnose problems.

Below is a remakefile that is very similar to the one you have in your code repository (the packages and sources fields were removed for brevity, but they are unchanged):

targets:
  all:
    depends: 3_visualize/out/figure_1.png

  site_data:
    command: download_nwis_data()

  1_fetch/out/site_info.csv:
    command: nwis_site_info(fileout = '1_fetch/out/site_info.csv', site_data)

  1_fetch/out/nwis_01427207_data.csv:
    command: download_nwis_site_data('1_fetch/out/nwis_01427207_data.csv')

  1_fetch/out/nwis_01435000_data.csv:
    command: download_nwis_site_data('1_fetch/out/nwis_01435000_data.csv')

  site_data_clean:
    command: process_data(site_data)

  site_data_annotated:
    command: annotate_data(site_data_clean, site_filename = '1_fetch/out/site_info.csv')

  site_data_styled:
    command: style_data(site_data_annotated)    

  3_visualize/out/figure_1.png:
    command: plot_nwis_timeseries(fileout = '3_visualize/out/figure_1.png', 
      site_data_styled, width = 12, height = 7, units = I('in'))

Two file targets ("1_fetch/out/nwis_01427207_data.csv" and "1_fetch/out/nwis_01435000_data.csv") were added to this remakefile, but there were no changes to the functions, since download_nwis_site_data() already exists and is used to create a single file that contains water monitoring information for a single site.


remake::diagram()

The remake package has a nice function called diagram() that we haven't covered yet. It produces a dependency diagram for the target(s) you specify (remember that all is the default target). For this modified remakefile, calling that function with the default arguments produces:

remake::diagram()

remake_diagram

If you run the same command, you'll see something similar but the two new files won't be included.


Seeing this diagram helps develop a greater understanding of some of the earlier concepts from intro-to-pipelines. Here, you can clearly see the connection between all and "3_visualize/out/figure_1.png". The figure_1 plot needs to be created in order to complete all. The arrows communicate the connections (or "dependencies") between targets, and if a target doesn't have any arrows connected to it, it isn't depended on by another target and it doesn't depend on any another targets. The two new .csv files are both examples of this, and in the image above they are floating around with no connections. A floater target like these two won't be built by scmake() unless it is called directly (e.g., scmake("1_fetch/out/nwis_01427207_data.csv"); remember again "skip the work you don't need").

The diagram also shows how the inputs of one function create connections to the output of that function. site_data is used to build site_data_clean (and is the only input to that function) and it is also used as an input to "1_fetch/out/site_info.csv", since the nwis_site_info() function needs to know what sites to get information from. These relationships result in a split in the dependency diagram where site_data is directly depended on by two other targets.

By modifying the recipe for the all target, it is possible to create a dependency link to one of the .csv files, which would then result in that file being included in a build as it becomes necessary in order to complete all:

targets:
  all:
    depends: ["3_visualize/out/figure_1.png", "1_fetch/out/nwis_01427207_data.csv"]

And after calling remake::diagram(), it is clear that now "1_fetch/out/nwis_01427207_data.csv" has found a home and is relevant to all! remake_diagram_update

With this update, the build of "1_fetch/out/nwis_01427207_data.csv" would no longer be skipped when scmake() is called.


:keyboard: comment on what you learned from exploring remake::diagram()


I'll sit patiently until you comment

jordansread commented 4 years ago

f

github-learning-lab[bot] commented 4 years ago

Using which_dirty() and why_dirty() to explore status of pipeline targets

In the image contained within the previous comment, you may have noticed the faded fill color of the target shapes. That styling signifies that the targets are out of date ("dirty") or haven't been created yet.

We've put some fragile elements in the pipeline that will be addressed later, but if you were able to muscle through the failures with multiple calls to scmake(), you likely were able to build the figure near the end of the dependency diagram. For this example, we'll stop short of building the "3_visualize/out/figure_1.png" target by calling scmake('site_data_styled') instead to illustrate the concept of a dirty target.

Which targets are incomplete/dirty?

The updated remake::diagram() output looks like this: image

Only the colors have changed from the last example, signifying that the darker targets are "complete", but that "3_visualize/out/figure_1.png" and the two data.csv files still don't exist.

the scipiper package has a useful function called which_dirty() which will list the incomplete ("dirty") targets that need to be updated in order to satisfy the output (once again, the default for this function is to reference the all target).

which_dirty()
[1] "1_fetch/out/nwis_01427207_data.csv" "3_visualize/out/figure_1.png"       "all"                

This output tells us the same thing as the visual, namely that these three targets :point_up: are incomplete/dirty. No information is shared for the "1_fetch/out/nwis_01435000_data.csv" target, even though it is out of date (and would be considered "dirty" as well). This is because that target is not relevant to building all.


Why are these targets dirty?

Calling why_dirty() on a single target tells us a number of important things

why_dirty("3_visualize/out/figure_1.png")
The target '3_visualize/out/figure_1.png' does not exist
# A tibble: 4 x 8
  type     name                         hash_old hash_new                         hash_mismatch dirty dirty_by_descent current
  <chr>    <chr>                        <chr>    <chr>                            <lgl>         <lgl> <lgl>            <lgl>  
1 target   3_visualize/out/figure_1.png none     none                             FALSE         TRUE  FALSE            FALSE  
2 depends  site_data_styled             NA       183e7990d33bbc76314aa48f04e58531 NA            FALSE FALSE            TRUE   
3 fixed    NA                           NA       96653adafc4622c2088c81ea947966af NA            FALSE FALSE            TRUE   
4 function plot_nwis_timeseries         NA       4592eea358bfd73d90bee824dda0e0c7 NA            FALSE FALSE            TRUE   

From this output, with a little help from the ?scipiper::why_dirty() documentation, it is clear that "3_visualize/out/figure_1.png" is "dirty" for several reasons:

A build of the figure with scipiper::scmake('3_visualize/out/figure_1.png') will update the target dependencies, result in a remake::diagram() output which darkens the fill color on the "3_visualize/out/figure_1.png", and cause a call to why_dirty("3_visualize/out/figure_1.png") to result in a descriptive error letting the user know the target is not dirty.


The target will be out of date if there are any modifications to the upstream dependencies (follow the arrows in the diagram "upstream") or to the function plot_nwis_timeseries(). Additionally, a simple update to the value of one of the "fixed" arguments will cause the "3_visualize/out/figure_1.png" target to be "dirty". Here the height argument was change from 7 to 8)

why_dirty("3_visualize/out/figure_1.png")
Since the last build of the target '3_visualize/out/figure_1.png':
  * the fixed arguments (character, logical, or numeric) to the target's command have changed
# A tibble: 4 x 8
  type     name                         hash_old                         hash_new                         hash_mismatch dirty dirty_by_descent current
  <chr>    <chr>                        <chr>                            <chr>                            <lgl>         <lgl> <lgl>            <lgl>  
1 target   3_visualize/out/figure_1.png bfdee70b50f05636b06ad32ef3b11810 bfdee70b50f05636b06ad32ef3b11810 FALSE         TRUE  FALSE            FALSE  
2 depends  site_data_styled             183e7990d33bbc76314aa48f04e58531 183e7990d33bbc76314aa48f04e58531 FALSE         FALSE FALSE            TRUE   
3 fixed    NA                           96653adafc4622c2088c81ea947966af 82eb1fa6b001fe33b8af3c8a629421a8 TRUE          FALSE FALSE            TRUE   
4 function plot_nwis_timeseries         4592eea358bfd73d90bee824dda0e0c7 4592eea358bfd73d90bee824dda0e0c7 FALSE         FALSE FALSE            TRUE 

The "hash" of these three fixed input arguments was "96653adafc4622c2088c81ea947966af" and is now "82eb1fa6b001fe33b8af3c8a629421a8", resulting in a "hash_mismatch" and causing "3_visualize/out/figure_1.png" to be "dirty". You'll hear more about hashes in the future, but for now, think of a hash as a string that is unique for any unique data. In the case of fixed arguments, changing the argument names, values, or even the order they are specified will create a different hash value and cause the output target to be considered dirty.


:keyboard: using which_dirty() and why_dirty() can reveal unexpected connections between the target and the various dependencies. Comment on some of the different information you'd get from why_dirty() that wouldn't be available in the visual produced with remake::diagram().


I'll sit patiently until you comment

jordansread commented 4 years ago

f

github-learning-lab[bot] commented 4 years ago

What are cyclical dependencies and how to avoid them?

using remake::diagram() shows the dependency diagram of the pipeline. Look at previous comments to remind yourself of these visuals.

As a reminder, the direction of the arrows capture the dependency flow, and site_data sits at the top, since it is the first target that needs to be built.

Also note that there are no backward looking arrows. What if site_data relied on site_data_styled? In order to satisfy that relationship, an arrow would need to swing back up from site_data_styled and connect with site_data. Unfortunately, this creates a cyclical dependency since changes to one target change the other target and changes to that target feed back and change the original target...so on, and so forth...

This potentially infinite loop is confusing to think about and is also something that dependency managers can't support. We won't say much more about this issue here, but note that in the early days of building pipelines if you run into a cyclical dependency error, this is what's going on.


:keyboard: Add a comment when you are ready to move on.


I'll sit patiently until you comment

jordansread commented 4 years ago

f

github-learning-lab[bot] commented 4 years ago

Creating side-effect targets or undocumented inputs

Moving into a pipeline-way-of-thinking can reveal some suprising habits you created when working under a different paradigm. Moving the work of scripts into functions is one thing that helps compartmentalize thinking and organize data and code relationships, but smart pipelines require even more special attention to how functions are designed.

side-effect targets

It is tempting to build functions that do several things; perhaps a plotting function also writes a table, or a data munging function returns a data.frame, but also writes a log file. You may have noticed that there is no easy way to specify two ouputs from a single function in a pipeline recipe. We can have multiple files as inputs into a function/command, but only one output/target. If a function writes a file that is not explicitly connected in the recipe (i.e., it is a "side-effect" output), the file is untracked by the dependency manager, and treated like an implicit file target (i.e., a target which has no "command" to specify how it is built; future issues will cover more details on this type of target). If the side-effect file is relied upon by a later target, changes to the side-effect target will indeed trigger a rebuild of the downstream target, but the dependency manager will have no way of knowing when the side-effect target itself should be rebuilt. :no_mobile_phones:

Maybe the above doesn't sound like a real issue, since the side-effect target would be updated every time the other explicit target it is paired with is rebuilt. But this becomes a scary problem (and our first real gotcha!) if the explicit target is not connected to the critical path of the final sets of targets you want to build, but the implicit side-effect target is. What this means is that even if the explicit target is out of date, it will not be rebuilt because building this target is unnecessary to completing the final targets (remember "skip the work you don't need" :arrow_right_hook:). The dependency manager doesn't know that there is a hidden rule for updating the side-effect target and that this update is necessary for assuring the final targets are up-to-date and correct. :twisted_rightwards_arrows:

Side-effect targets can be used effectively, but doing so requires a good understanding of implications for tracking them and advanced strategies on how to specify rules and dependencies in a way that carries them along. :ballot_box_with_check:

undocumented inputs


Additionally, it is tempting to code a filepath within a function which has information that needs to be accessed in order to run. This seems harmless, since functions are tracked by the dependency manager and any changes to those will trigger rebuilds, right? Not quite. If a file like "1_fetch/in/my_metadata.csv" is specified in a function, but is not also an argument to the command in the remakefile recipe or listed as a depends of the resulting target, any changes to the "1_fetch/in/my_metadata.csv" will go unnoticed by the dependency manager, since the string that specifies the file name remains unchanged. The system isn't smart enough to know that it needs to check whether that file has changed.

As a rule, unless you are purposefully trying to hide changes in a file from the dependency manager, do not read non-argument files in the body of a function. :end:


:keyboard: Add a comment when you are ready to move on


I'll sit patiently until you comment

jordansread commented 4 years ago

f

github-learning-lab[bot] commented 4 years ago

How (not to) depend on a directory for changes

You might have a project where there is a directory :file_folder: with a collection of files. To simplify the example, assume all of the files are .csv and have the same format. As part of the hypothetical project goals, these files need to be combined and formatted into a single plottable data.frame.

In a data pipeline, we'd want assurance that any time the number of files changes, we'd rebuild the resulting data.frame. Likewise, if at any point the contents of any one of the files changes, we'd also want to re-build the data.frame.

This hypothetical example could be coded as

sources:
  - combine_files.R

targets:
  all:
    depends: figure_1.png

  plot_data:
    command: combine_into_df('1_fetch/in/file1.csv','1_fetch/in/file2.csv','1_fetch/in/file3.csv')

  figure_1.png:
    command: my_plot(plot_data)

:point_up: This coding would work, as it tells the dependency manager which files to track for changes, and where the files in the directory are. But this solution is less than ideal, both because it doesn't scale well to many files, and because it doesn't adapt to new files coming into the 1_fetch/in directory :file_folder: (the pipeline coder needs to manually add file names)


Alternatively, what about adding the directory as an input to the recipe, like this (you'd also need to modify your combine_into_df function to use dir(work_dir) to generate the file names)?

sources:
  - combine_files.R

targets:
  all:
    depends: figure_1.png

  plot_data:
    command: combine_into_df(work_dir = '1_fetch/in')

  figure_1.png:
    command: my_plot(plot_data)

After running scmake(), :point_up: this looks like it works, but results in a rather cryptic warning message might end up being ignored:

Warning messages:
1: In structure(.Call(C_Rmd5, files), names = files) :
  md5 failed on file '1_fetch/in'

An explaination of that warning message - In order to determine if file contents have changed, remake uses md5sum() from the tools package to create a unique hash for each target. You can take a look at what this does:

tools::md5sum('3_visualize/out/figure_1.png')
      3_visualize/out/figure_1.png 
"500a71be79d1e45457cdb2c31a03be46" 

md5sum() fails when you point it toward a directory, since it doesn't know what to do with that kind of input.


A third strategy might be to create a target that lists the contents of the directory, and then uses that list as an input:

sources:
  - combine_files.R

targets:
  all:
    depends: figure_1.png

  work_files:
    command: dir('1_fetch/in')

  plot_data:
    command: combine_into_df(work_files)

  figure_1.png:
    command: my_plot(plot_data)

:point_up: This approach is close, but has a few flaws: 1) because '1_fetch/in' can't be tracked as a real target, changes within that directory aren't tracked (same issue as the previous example), and 2) if the contents of the files change, but the number and names of the files don't change, the work_files target won't appear changed, and therefore the necessary rebuild of plot_data would be skipped (another example of "skip the work you don't need" in action)

Instead, we would avocate for a modified approach that combines the work of md5sum with the collection of files returned with dir. If instead of work_files being a vector of file names (in this case, c('1_fetch/in/file1.csv','1_fetch/in/file2.csv','1_fetch/in/file3.csv')) - can we make it a data.frame of file names and their associated hashes?

# A tibble: 3 x 2
  filepath             hash                            
  <chr>                <chr>                           
1 1_fetch/in/file1.csv 530886e2b604d6d32df9ce2785c39672
2 1_fetch/in/file2.csv dc36b6dea28abc177b719015a18bccde
3 1_fetch/in/file3.csv c04f87fb9af74c4e2e2e77eed4ec80f3

Now, if the combine_into_df can by modified to accept this type of input (i.e., simply grab the filepath with $filepath), then we'll have an input that both reflects the total number of files in the directory, and captures a reference hash that will reveal any future changes. Yay! :star2: This works because a change to any one of those hashes will result in a change in the hash of the new modified work_files data.frame, which then would cause a rebuild of plot_data.

But unfortunately, because work_files still relies on a directory and therefore can't be tracked, we're still missing a critical part of the solution. In the next topic, we'll reveal a more elegant way to trigger rebuilds, but for now, note you'd need to manually run

scmake('work_files', force = TRUE) to force scipiper to dig into that directory and hash all of the files. If the result of work_files that comes back is the same as before, running scmake() won't build anything new and we can be confident that figure_1.png is up to date.


:keyboard: Add a comment when you are ready to move on.


I'll sit patiently until you comment

jordansread commented 4 years ago

f

github-learning-lab[bot] commented 4 years ago

What to do when you want to specify a non-build-object input to a function?

Wow, we've gotten this far and haven't written a function that accepts anything other than an object target or a file target. I feel so constrained!

In reality, R functions have all kinds of other arguments, from logicals (TRUE/FALSE), to characters that specify which configurations to use.

The example in your working pipeline creates a figure, called 3_visualize/out/figure_1.png. Unless you've made a lot of modifications to the plot_nwis_timeseries() function, it has a few arguments that have default values, namely width = 12, height = 7, and units = 'in'. Nice, you can control your output plot size here!

But adding those to the remake file like so


  3_visualize/out/figure_1.png:
    command: plot_nwis_timeseries(target_name, site_data_styled, width = 12, height = 7, units = 'in')

causes an immediate error

scmake()
 Error in .remake_add_targets_implied(obj) : 
  Implicitly created targets must all be files:
 - in: (in 3_visualize/out/figure_1.png) -- did you mean: site_data, 1_fetch/out/site_info.csv, 

Since 'in' is not an object target with a recipe in the pipeline, remake is trying to find the file corresponding to "in", since it must be a file if it isn't a number or an object.

We know "in" is not a file, it is instead a simple argument we want to expose in the recipe, so we can make modification. To do this, wrapping the argument in the I() function tells remake to "treat this argument 'as is'", meaning don't try to infer anything fancy, just pass it along to the function.


  3_visualize/out/figure_1.png:
    command: plot_nwis_timeseries(target_name, site_data_styled, width = 12, height = 7, units = I('in'))

works! :star2:


Going back to our previous example where we wanted to build a data.frame of file hashes from the '1_fetch/in' directory, I() comes in handy in two spots: 1) we can avoid the md5sum warning by wrapping the directory path in the I(), which means '1_fetch/in' is used as a real string instead of attempting to treat it as a file target, and 2) we can add a new variable that can help us force or refresh the target with some information about when we last did it.

sources:
  - combine_files.R

targets:
  all:
    depends: figure_1.png

  work_files:
    command: hash_dir_files(directory = I('1_fetch/in'), dummy = I('2020-05-18'))

By adding this dummy argument to our own custom hash_dir_files() function, we can modify the dummy contents any time we want to force the update of work_files. Making a change to the text in the dummy argument has the same outcome as scmake('work_files', force = TRUE), but this way we keep an easy to read record of when we last manually refreshed the pipeline's view of what exists within '1_fetch/in'. You may see the use of these dummy arguments in spots where there is no other trigger that would cause a rebuild (such as the case of these directory challenges or when pulling data from a remote webservice or website - scipiper has no way of knowing that new data are available on the same service URL).


:keyboard: Add a comment when you are ready to move on.


I'll sit patiently until you comment

jordansread commented 4 years ago

asdf

github-learning-lab[bot] commented 4 years ago

The target_name special variable.

Simplifying target:command relationships and reducing duplication

In your repo, there is a remake files that specifies relationships between targets, commands, and other targets.

For one of the targets, you'll see this:

  1_fetch/out/site_info.csv:
    command: nwis_site_info(fileout = '1_fetch/out/site_info.csv', site_data)

It seems odd to duplicate the name of the target twice. First it is used to declare the file target path as 1_fetch/out/site_info.csv. But then it is used a second time because the nwis_site_info needs to know the name of the file to write the downloaded data too. Doesn't this seem duplicative and potentially dangerous (what if you make a typo to the fileout argument and it writes to a different file)?

The target_name variable is a useful way to avoid the duplication and danger of repeating the name of the target as one of the inputs. We can instead write the target recipe as

  1_fetch/out/site_info.csv:
    command: nwis_site_info(fileout = target_name, site_data)

Here, target_name will tell scipiper to use "1_fetch/out/site_info.csv" for the fileout argument in the nwis_site_info function when building the target. Using target_name not only saves us some time and makes the construction of pipelines less error prone, it also helps us create patterns that can be used to generate multiple targets in a simpler fashion.

Imagine part of the target name is used to specify something within the function call, such as a site number.

  1_fetch/out/nwis_01427207_data.csv:
    command: download_nwis_site_data(target_name)

  1_fetch/out/nwis_01432160_data.csv:
    command: download_nwis_site_data(target_name)

  1_fetch/out/nwis_01436690_data.csv:
    command: download_nwis_site_data(target_name) 

the download_nwis_site_data function uses a regular expression to extract the site number from the input file name in order to make a web service request for data. Here, even though it seems goofy to create so many download targets, we reduce the places mistakes can be made by using target_name instead of repeating the use of the file name.


:keyboard: Add a comment when you are ready to move on.


I'll sit patiently until you comment

jordansread commented 4 years ago

asdf

github-learning-lab[bot] commented 4 years ago


When you are done poking around, check out the next issue.