use SLURM for multiprocessing

pjvandehaar commented 7 years ago

option 1: make `pheweb qq --phenos=1-100` and run that via SLURM

/full/path/pheweb uses the correct PYTHONPATH I think. So env PHEWEB_DATADIR=$(pwd) $(which pheweb) cmd --phenos=1-100 --local should be correct and I can use sbatch --cpus-per-task=1 --mem=2048 --error=$(get_tmp_path()) --quiet --time=$((3600*24)) $(which pheweb) config data_dir={conf.data_dir} n_cpus=1 {cmd} --phenos=1-100.

using array jobs:

(docs)

pheweb slurm parse N would create a shell script which can be sbatched and will make N single-core jobs with --time=0.

option 2: make a master-worker architecture with message-passing

the master starts a server on some open port.
the master submits pheweb worker --connect {ip}:{port} to SLURM a bunch of times.
the worker connects back to the master's server, creating a bi-directional pipe
- connection is http or ssh tunnel? ssh tunnel is ~guaranteed to avoid firewalls. it's fine with IP address, right? if this channel fails, drop debugging info in {conf.data_dir}/tmp/mp/$(hostname). it'll require .ssh/authorized_keys, so assert that exists.
- communication over channel is just json-lines?
- celery is built for this, and requires redis or RabbitMQ.
the master and worker begin sending minutely heartbeats. if heartbeats fail, worker exits.
the master sends a RPC to the worker, like {cmd:'augment-phenos', phenos:[0,1,2,3]} or {cmd:'exit'}.
the worker starts a thread (or process?) to handle the task, and leaves one thread to communicate with the master. workers send back exceptions and results, pickled/jsonified.
when master needs more workers, it submits more to SLURM.

some packages to do a large portion of this:

dask.distributed does scheduling with a shared queue.
- dask-drmaa supports SLURM &c.
list of python parallelization packages
- jug looks reasonable.
- scoop supports map-reduce for SLURM.
- drmaa-python makes python commands for working with SLURM &c

pjvandehaar commented 7 years ago

Option 1 finished by commit https://github.com/statgen/pheweb/commit/9db09299fb2b68c9c2e4452659a8ff4e6cb3aa51 for pheweb parse using pheweb slurm-parse.

Next up:

Option 1: make pheweb slurm <subcommand>, which plugs into the architecture those commands are already using.
Option 2: make pheweb slurm-<subcommand> for each subcommand I want to SLURMify.
Option 3: don't do anything, and just leave it as it is.

pjvandehaar commented 7 years ago

option 3

statgen / pheweb