ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 128 forks source link

a separate package for drake's job scheduling #285

Closed wlandau closed 6 years ago

wlandau commented 6 years ago

Some users have requested the option to have drake act like an ordinary job scheduler without worrying about reproducibility. And a separate package would be a great place to apply the knapsack problem and group jobs into workers.

EDIT : 2018-03-03

I have started a private repo for a package called crew (Coordinated R Ensembles of Workers). See the comments later in the thread for details. I am really excited to work on this.

wlandau commented 6 years ago

Because of #289, I actually just started a package called rsched. Unfortunately, since it's a new repo, the actual code needs to stay closed-source until there is at least minimal proof-of-concept functionality, but I will open-source it ASAP. For now, we can write a design spec somewhere in the drake repo.

wlandau commented 6 years ago

I plan to create a bookdown document in a separate drake branch for the design specification of a drake scheduler. Will post updates on this thread.

wlandau commented 6 years ago

As discussed in #283, we should search the literature for good scheduling designs and algorithms. For the new package (maybe named rsched), I have a stub of a design spec at https://github.com/ropensci/drake/tree/scheduler. We should plan ahead.

wlandau commented 6 years ago

I have a private repo for a package called crew (Coordinated R Ensembles of Workers), and I am really excited to share this preliminary work. A proof of concept for persistent workers is fully fleshed out, but not actually working yet. I will open-source it ASAP. The main bottleneck I see is to fix the existing functionality. The master process (launched with callr::r_bg()) currently hangs instead of posting jobs for the workers, and I am struggling to fix it.

wlandau commented 6 years ago

Edit: changing the name to workers since it is actually available. Fixed the issues with callr and deadlock. Will open-source it as soon as I get permission.

krlmlr commented 6 years ago

crew() seems to be available on CRAN, I think it's a lovely name:

available::available("crew")
#> ── crew ─────────────────────────────────────────────────────────────────────────────────────────────────────────
#> Name valid: ✔
#> Available on CRAN: ✔ 
#> Available on Bioconductor: ✔
#> Available on GitHub:  ✔ 
#> Bad Words: ✔
#> Abbreviations: http://www.abbreviations.com/crew
#> Wikipedia: https://en.wikipedia.org/wiki/crew
#> Wiktionary: https://en.wiktionary.org/wiki/crew
#> Urban Dictionary:
#>   the sport of gods, requires constant physical exertion, perfect  poise, balance, timing, awareness, brute force, and a sensitive  touch.
#>   Tags: gang rowing group posse crews friends coxswain homies clique krew
#>   http://crew.urbanup.com/896326
#> Sentiment:???

Created on 2018-03-04 by the reprex package (v0.2.0).

wlandau commented 6 years ago

Thanks, Kirill! But on second thought, I think the name "workers" is better.

wlandau commented 6 years ago

FYI: the workers package is now out in the open: https://github.com/wlandau/workers. The current code is just a proof of concept. We should write a full design spec before any more serious work on the implementation.

wlandau commented 6 years ago

FYI: I just drafted an initial design spec for workers. I think it will help us either (1) develop the package, or (2) figure out if we should abandon it in favor of a solution already in progress. @krlmlr, you mentioned that @gaborcsardi and @lionel- might already be working on the problem. Maybe @HenrikBengtsson also has plans, I do not know.

wlandau commented 6 years ago

I think I see a way to move forward with the workers package: the custom message queue in #408. I plan to externalize this minimalist queue as a separate package and then build on top of it. Whether we offload drake functionality to these packages depends on how well they mature.

drake has a ridiculous amount of code, but it also depends on a ridiculous number of packages. Decisions about offloading could shift the scales, but drake will still be an enormous package either way. My opinions about the relevant tradeoffs are not as strong as they once were.

wlandau commented 6 years ago

I'm having new doubts about this one because of the special precautions drake needs to take in order to account for the latency of sending newly-built targets over a network. Whenever a remote persistent worker finishes a target, it sends a checksum to the master so the master can wait for the right data to arrive. This is only practical because drake hashes all the targets already. For a general-purpose dependency-aware job scheduler for R, imposing this hashing for its own sake may be unreliable and cause too much of a delay. Plus, as I learn about how thoroughly solved this problem already is in tools like dask, I am starting to think the next step is to try writing an R front-end to an established tool (which could in turn become another drake HPC backend). Ref: #417.