Batch File Processing Workflow

ropensci / unconf18

http://unconf18.ropensci.org/

44 stars 4 forks source link

Batch File Processing Workflow #47

Open laderast opened 6 years ago

laderast commented 6 years ago

Hi Everyone,

I've been kicking this idea around for a little bit. Our group does a lot of batch processing of input files when we run our pipeline for flow cytometry data. Sometimes the output of a step will fail, and we have to flag the files that fail so they aren't passed through further steps in the pipeline.

When I do this currently, I basically build file manifests (location of files with relevant metadata) and run some sort of processing in R. I was thinking maybe by incorporating data assertions (like with assertr), we can have a workflow that shows when files pass a step, and flags those files that fail a processing step. In the end, we can display to users of the pipeline which files passed and which files didn't, and which steps.

Maybe there's a little germ of an idea here that might work for the unconf. I'm not sure, so I'm putting it out there.

karthik commented 6 years ago

Drake can do a lot of what you are asking for (cc @wlandau-lilly). vis_drake_graph on your Drake plan should show you steps that failed (in red). Fixing those should only re-run those steps and not everything from scratch. You might take a look at https://github.com/ropensci/drake and see if it's something that matches your needs.

wlandau commented 6 years ago

+1 to that! Suppose we're working with this data analysis workflow and one of our functions does not work.

create_plot <- function(data) {
  ggplot(data, aes(x = Petal.Width, fill = Species)) +
    geom_histogram(binwidth = 0.25) +
    theme_gray(20) +
    BAD_LAYER
}

We run our workflow and see.

> make(plan)
target raw_data
target data
target fit
target hist
fail hist
Error: Target `hist`` failed. Call `diagnose(hist)` for details. Error message:
  object 'BAD_LAYER' not found

Diagnostics include warnings, errors, messages, and other context.

> diagnose(hist)
> diagnose(hist)
$target
[1] "hist"

$messages
NULL

$error
<simpleError in create_plot(data): object 'BAD_LAYER' not found>

You can list the failed targets programatically.

> failed()
[1] "hist"

As Karthik mentioned, these failures are shown in the dependency graph.

config <- drake_config(plan)
vis_drake_graph(config, targets_only = TRUE, full_legend = FALSE)

capture

laderast commented 6 years ago

Ah, very cool. I didn't know about Drake!

maurolepore commented 6 years ago

@laderast, would purrr::safely() and purrr::possibly() help you?

laderast commented 6 years ago

Thanks @maurolepore - purrr::safely() is a nice approach