statgen / pheweb

A tool to build a website to browse hundreds or thousands of GWAS.
MIT License
158 stars 65 forks source link

use SLURM for multiprocessing #91

Closed pjvandehaar closed 7 years ago

pjvandehaar commented 7 years ago

option 1: make pheweb qq --phenos=1-100 and run that via SLURM

/full/path/pheweb uses the correct PYTHONPATH I think. So env PHEWEB_DATADIR=$(pwd) $(which pheweb) cmd --phenos=1-100 --local should be correct and I can use sbatch --cpus-per-task=1 --mem=2048 --error=$(get_tmp_path()) --quiet --time=$((3600*24)) $(which pheweb) config data_dir={conf.data_dir} n_cpus=1 {cmd} --phenos=1-100.

using array jobs:

(docs)

pheweb slurm parse N would create a shell script which can be sbatched and will make N single-core jobs with --time=0.

option 2: make a master-worker architecture with message-passing

  1. the master starts a server on some open port.
  2. the master submits pheweb worker --connect {ip}:{port} to SLURM a bunch of times.
  3. the worker connects back to the master's server, creating a bi-directional pipe
    • connection is http or ssh tunnel? ssh tunnel is ~guaranteed to avoid firewalls. it's fine with IP address, right? if this channel fails, drop debugging info in {conf.data_dir}/tmp/mp/$(hostname). it'll require .ssh/authorized_keys, so assert that exists.
    • communication over channel is just json-lines?
    • celery is built for this, and requires redis or RabbitMQ.
  4. the master and worker begin sending minutely heartbeats. if heartbeats fail, worker exits.
  5. the master sends a RPC to the worker, like {cmd:'augment-phenos', phenos:[0,1,2,3]} or {cmd:'exit'}.
  6. the worker starts a thread (or process?) to handle the task, and leaves one thread to communicate with the master. workers send back exceptions and results, pickled/jsonified.
  7. when master needs more workers, it submits more to SLURM.

some packages to do a large portion of this:

pjvandehaar commented 7 years ago

Option 1 finished by commit https://github.com/statgen/pheweb/commit/9db09299fb2b68c9c2e4452659a8ff4e6cb3aa51 for pheweb parse using pheweb slurm-parse.

Next up:

pjvandehaar commented 7 years ago

option 3