Open mllg opened 11 years ago
Is this still on the radar? I'm currently using the python tool, Snakemake, for job submission and dependency management. There are now many such workflow systems available now, but none in R that I know of.
I second this - I am working on bringing parallelism to the dscr
project (https://github.com/stephens999/dscr). I hope to use BatchJobs
to abstract away the serial/multicore/cluster contexts. In our dscr
workflows, we cache objects at many stages, the costly parts of the computations can vary, and intermediate objects can often be re-used, so dependency management of some sort is looking crucial.
The cluster engines I've investigated (TORQUE, SGE, SLURM) all appear to allow the user to specify dependencies based on completion of previous jobs (specified by the scheduler's job ID). My hope is for BatchJobs
to be able to receive dependencies from the user, encode them in the registry, and emit the appropriate dependency clauses to the cluster system.
If dependency management is considered to fit into the overall goals of BatchJobs
, I'd be happy to look deeper into the implementation and work on a pull request.
Leaving dependency management to the scheduler has some disadvantages, including the inability to test for error conditions on exit of dependent jobs.
Interesting point - is there a good alternative?
Managing the dependencies in R is much more flexible. The first pass is to simply make a graph of the job dependencies and then track completed jobs. A second step might include hooks to check for the appropriate completion of jobs. A third might might include automated dependency checking to determine if a job needs to be re-run (based on inputs changing, etc.)
Sean
On Fri, May 22, 2015 at 12:23 PM, Raman Shah notifications@github.com wrote:
Interesting point - is there a good alternative?
— Reply to this email directly or view it on GitHub https://github.com/tudo-r/BatchJobs/issues/4#issuecomment-104703831.
:+1:
@seandavi: At least LSF (which I'm primarily interested in) can define the dependency conditional on exit status. Isn't this true for other schedulers?
If the scheduler knows about dependencies, this is by far the easiest approach. The workflow you suggested -- reimplementing this in R -- sounds a bit like reimplementing make
or SCons
or whatnot. It's more flexible, for sure, but also much more tedious and error-prone. Also, if we do our own scheduling, this requires a process that is running constantly and uses "busy waiting" to be able to schedule runnable jobs.
Checking if a job needs to be re-run can be done as part of the job itself:
if (digest::digest(input) == digest::digest(last_good_input)) quit(0)
@ramanshah: Do you have any updates?
@mllg: My use case is a web of data pipelines: Each stage processes data and creates artifacts, some of which are processed in subsequent stages. Currently I'm using make
(with an autogenerated Makefile), but BatchJobs
scales so much better :-)
@seandavi: Of course, for the "multicore" schedulers we'll need our own dependency handling. Which, again, could happen with an autogenerated Makefile.
@krlmlr I left the position where I was working on this problem as part of my day job, so there won't likely be substantial news from me anymore. The group seems to have interest in building the benchmarking framework on top of a different foundation, possibly snake_make
or an implementation of the Common Workflow Language. Basically, the project involves executing a highly heterogeneous multi-step dependency graph, which is not really the kind of problem that BatchJobs
excels at, so we started going in a different direction as of last fall. But @road2stat, who is in the driver's seat for this project now, may have other thoughts.
@ramanshah, what I am interested in is what you describe. There are many frameworks for doing this kind of thing:
https://github.com/pditommaso/awesome-pipeline
It would be great to do something in R related to common-workflow-language. I'd definitely be interested in working with you and @road2stat on this.
SRC: https://code.google.com/p/batchjobs/issues/detail?id=19
For some experiments it MIGHT be useful to be able to specify a graph of dependent jobs, similar to how targets are defined in a Makefile.
This means, that for some jobs to starts, the results of others have to be fully completed. The solution for this probably is simple topological sorting wrt to preconditions.
But I want to collect more use cases, before we look into this again.