tudo-r / BatchJobs

BatchJobs: Batch computing with R
Other
85 stars 20 forks source link

Graph of dependent jobs? #4

Open mllg opened 11 years ago

mllg commented 11 years ago

SRC: https://code.google.com/p/batchjobs/issues/detail?id=19

For some experiments it MIGHT be useful to be able to specify a graph of dependent jobs, similar to how targets are defined in a Makefile.

This means, that for some jobs to starts, the results of others have to be fully completed. The solution for this probably is simple topological sorting wrt to preconditions.

But I want to collect more use cases, before we look into this again.

seandavi commented 9 years ago

Is this still on the radar? I'm currently using the python tool, Snakemake, for job submission and dependency management. There are now many such workflow systems available now, but none in R that I know of.

ramanshah commented 9 years ago

I second this - I am working on bringing parallelism to the dscr project (https://github.com/stephens999/dscr). I hope to use BatchJobs to abstract away the serial/multicore/cluster contexts. In our dscr workflows, we cache objects at many stages, the costly parts of the computations can vary, and intermediate objects can often be re-used, so dependency management of some sort is looking crucial.

The cluster engines I've investigated (TORQUE, SGE, SLURM) all appear to allow the user to specify dependencies based on completion of previous jobs (specified by the scheduler's job ID). My hope is for BatchJobs to be able to receive dependencies from the user, encode them in the registry, and emit the appropriate dependency clauses to the cluster system.

If dependency management is considered to fit into the overall goals of BatchJobs, I'd be happy to look deeper into the implementation and work on a pull request.

seandavi commented 9 years ago

Leaving dependency management to the scheduler has some disadvantages, including the inability to test for error conditions on exit of dependent jobs.

ramanshah commented 9 years ago

Interesting point - is there a good alternative?

seandavi commented 9 years ago

Managing the dependencies in R is much more flexible. The first pass is to simply make a graph of the job dependencies and then track completed jobs. A second step might include hooks to check for the appropriate completion of jobs. A third might might include automated dependency checking to determine if a job needs to be re-run (based on inputs changing, etc.)

Sean

On Fri, May 22, 2015 at 12:23 PM, Raman Shah notifications@github.com wrote:

Interesting point - is there a good alternative?

— Reply to this email directly or view it on GitHub https://github.com/tudo-r/BatchJobs/issues/4#issuecomment-104703831.

krlmlr commented 8 years ago

:+1:

@seandavi: At least LSF (which I'm primarily interested in) can define the dependency conditional on exit status. Isn't this true for other schedulers?

If the scheduler knows about dependencies, this is by far the easiest approach. The workflow you suggested -- reimplementing this in R -- sounds a bit like reimplementing make or SCons or whatnot. It's more flexible, for sure, but also much more tedious and error-prone. Also, if we do our own scheduling, this requires a process that is running constantly and uses "busy waiting" to be able to schedule runnable jobs.

Checking if a job needs to be re-run can be done as part of the job itself:

if (digest::digest(input) == digest::digest(last_good_input)) quit(0)

@ramanshah: Do you have any updates?

krlmlr commented 8 years ago

@mllg: My use case is a web of data pipelines: Each stage processes data and creates artifacts, some of which are processed in subsequent stages. Currently I'm using make (with an autogenerated Makefile), but BatchJobs scales so much better :-)

@seandavi: Of course, for the "multicore" schedulers we'll need our own dependency handling. Which, again, could happen with an autogenerated Makefile.

ramanshah commented 8 years ago

@krlmlr I left the position where I was working on this problem as part of my day job, so there won't likely be substantial news from me anymore. The group seems to have interest in building the benchmarking framework on top of a different foundation, possibly snake_make or an implementation of the Common Workflow Language. Basically, the project involves executing a highly heterogeneous multi-step dependency graph, which is not really the kind of problem that BatchJobs excels at, so we started going in a different direction as of last fall. But @road2stat, who is in the driver's seat for this project now, may have other thoughts.

seandavi commented 8 years ago

@ramanshah, what I am interested in is what you describe. There are many frameworks for doing this kind of thing:

https://github.com/pditommaso/awesome-pipeline

It would be great to do something in R related to common-workflow-language. I'd definitely be interested in working with you and @road2stat on this.