ropensci / targets

Function-oriented Make-like declarative workflows for R
https://docs.ropensci.org/targets/
Other
928 stars 73 forks source link

Debugging: possible to backtrack through multiple branched targets? #201

Closed mike-lawrence closed 4 years ago

mike-lawrence commented 4 years ago

Prework

Question

I have a pipeline of dynamic targets and a branch of one is throwing an error. Using the amazing tar_workspace(), I discovered that the data for that branch is not what I expected from the prior processing steps, so I need to work backwards through that branch's progression through the pipeline to see which processing step induced the corruption. On inspecting the tar_meta() tibble I guessed that I could use the depend and data columns for this backtracking:

#get the depend value of the first error
depend = 
    (
        tar_meta() 
        #filter to errors:
        %>% dplyr::filter(!is.na(error))
        #grab the first row:
        %>% dplyr::slice(1)
        #grab the value in the depend column:
        %>% dplyr::pull(depend)
    )

#look for matches in the data column
(
    tar_meta()
    %>% dplyr::filter(
        data==depend
    )
)

But that yields no matches, so my guess was wrong. Is this kind of backtracking possible and I'm missing something?

wlandau commented 4 years ago

depend is an overall dependency hash, and it does not help identify the names individual upstream neighbors. I suggest running the pipeline with tar_option_set(error = "continue") so the metadata completely populates. Then, you can look for clues in the children column in tar_meta(). The branch names should give you an idea of branch order in patterns and allow you to backtrack.

mike-lawrence commented 4 years ago

Ah, ok, so for a target that has branches, I see that the value in the children column is a vector of character strings, each presumably naming a child branch and the order of which is consistent from target to target? So if a branch on targ2 has an error, all I need to do is look up it's index in it's parent's children vector, then go to the prior target and get the child at that index?

wlandau commented 4 years ago

The order should be consistent as long as the targets are up to date, or at least synchronized.

mike-lawrence commented 4 years ago

For posterity, here's code to work out the index of the first bad child:

#retrieve the metadata
tm = tar_meta()

#find the first bad child
bad_child =
    (
        tm
        # filter to errors:
        %>% dplyr::filter(!is.na(error))
        # grab the first row:
        %>% dplyr::slice(1)
    )

#find the index of the bad child
bad_child_index =
    (
        tm
        # filter to the parent
        %>% dplyr::filter(
            name==bad_child$parent
        )
        # unnest the children column (creates one row per child)
        %>% tidyr::unnest(
            children
        )
        # get index
        %>% dplyr::summarise(
            index = which(children==bad_child$name)
        )
        # pull the index value
        %>% dplyr::pull(index)
    )

Next I would take a look at the code in _targets.R or the graphs from tar_glimpse() to remind myself what the preceding parent is, then get it's list of children and finally index into that list to get the name of the predecessor branch to the bad branch. To this end, it would be helpful to be able to automate the "remind myself" step by instead having a column in the meta tibble that lists the dependencies for parent targets (ideally just the dependencies that are themselves targets, as with tar_glimpse()).

(I'd make a FR issue for this, but I see that it's requested that FRs be first posted in issues related to the trouble that inspired the feature as a solution (a policy that makes sense). )

wlandau commented 4 years ago

To this end, it would be helpful to be able to automate the "remind myself" step by instead having a column in the meta tibble that lists the dependencies for parent targets (ideally just the dependencies that are themselves targets, as with tar_glimpse()).

I considered it, but this would considerably increase the size of the metadata, especially relative to how seldom people would likely use it. I would prefer to limit the metadata to just the essentials.

mike-lawrence commented 3 years ago

Hm, this is harder to do than I realized. If I run with tar_option_set(error='save'), then I can get the workspace for a child that throws an error, but if I then decide that the source of the error is a prior target, the parent target doesn't have an entry in the meta tibble, so I have to re-run with tar_option_set(error='continue') just to populate meta with the parent target info (to then get the index of the child and work backwards). Any suggestions for doing this more efficiently? Would it make any sense to always have two parent entries in the meta tibble, one added prior to any children, and one at the end, with a column differentiating the two? Or maybe a "save_and_continue" error behaviour option? Or both? I might be wrong, but I think this would be a pretty common debugging task that would warrant the extra complexity.

Edit 1: Oh wait, if the prior target(s) completed successfully, there'd be no workspace to load to work out the root cause of the data corruption. So this whole process is not possible, right?

Edit 2: Sorry, it still makes sense. I forgot that the idea was to use the known index of the bad child and then (assuming earlier targets use list iteration) grab the successful output of earlier targets at that index.

wlandau commented 3 years ago

I hesitate to allow both "save" and "continue" because workspace files can get quite large, and several hundred of them could explode the data store. Childproofing is a major goal of targets relative to drake. And an extra write to the metadata could incur extra slowness and room for error.

The whole workspace/debugging feature set leaves room for improvement: for example, a workspace file could be much smaller. In fact, it could just be a bunch of references to files already in the data store, which would also allow un-serialize()-able objects like Keras models to be part of the workspace. But that would also require a lot more bookkeeping and create room for error. So I backed away from these rabbit holes and decided to stick with 20% of the work that gets us 80% of the functionality. Long-term, this policy will help keep the infrastructure of targets clean and reliable for years to come.

On reflection, it seems like the best course of action in your case is good old fashioned defensive programming. If you find yourself backtracking a lot, you can return values from targets that have some sort of indexing system. For example, instead of returning a fitted model object, you could return a broom::tidy() data frame with the index of the replicate as a column.

mike-lawrence commented 3 years ago

I also just realized that I should really just be using tar_option_set(error='continue') by default, then checking meta for the earliest bad children. I lose use of tar_workspace but now that I know how to retrieve the output of the prior target for that branch, it's pretty easy to achieve the same end. Sorry for all my unnecessary comments here, and thanks for all your work!

wlandau commented 3 years ago

Glad to help. You might consider pairing that with interactive debugging with tar_option_set(debug = "your_branch_name").