Open peterjc opened 4 years ago
Documenting the core of my current setup using SLURM, which submits one worker job for each plate of raw data. It assumes a raw_data_by_plate/plate*
style input tree.
First master script go.sh
:
#!/bin/bash
# Array job workers each do a single plate (directory)
# Maps array task number (1 to N) to the directory name
# where there are N directories.
set -euo pipefail
Q=medium
N=`ls -1 raw_data_by_plate/ | grep ^plate | wc -l`
NAME="thapbi-pict-pipeline-$N" # include today's date?
sbatch -p $Q --job-name=$NAME --array="1-$N" prepare_plate.sh
sbatch -p $Q --job-name=$NAME --dependency=singleton go_part2.sh
echo "Array job for FASTA files started, stage two pending (overall reports)"
Script prepare_plate.sh
is run on each plate (can call thapbi_pict prepare-reads
and thapbi_pict classify
, or if a per-plate report is useful just call thapbi_pict pipeline
with suitable output folder / name settings). The key trick here:
#!/bin/bash
#SBATCH --partition=short
#SBATCH --cpus-per-task=4
#SBATCH --mem=1G
# Array job worker thread to do a single plate (directory)
# Maps array task number (1 to N) to the directory name
# where there are N directories.
# Turn array job number into plate name (directory):
PLATE=`ls -1 raw_data_by_plate/ | grep ^plate | sort | sed -n ${SLURM_ARRAY_TASK_ID}p`
# etc
Using the SLURM singleton dependency trick, it waits until all the plate-level jobs are done, and moves onto stage two which can just run thapbi_pict pipeline
on the entire input tree and re-use the intermediate files already created, and then produce reports at project level covering all the plates.
Would #219 (intermediate files in a DB) break either approach to running the pipeline split up over multiple machines?
The following might be cluster agnostic, using the SGE array job or equivalent.
Given a job with N slots, run
thapbi_pict pipeline --slot 1;N ...
throughthapbi_pict pipeline --slot i;N ...
tothapbi_pict pipeline --slot N;N ...
the cluster.Each worker job would scan the FASTQ files (and/or folders?), determine the pairs, sort them, and then act of share i of N to run the prepare-reads and classify steps on the allocated samples.
Workers 1 to N-1 would then stop.
Worker N (which will be one of the last to finish) would wait for all the intermediate files, and then run the reports.
(You could get fancy with multiple cleanup workers, covering the different reports - the edit-graph for example is slow enough to split out)
The point of this idea is there is no master script, and no cluster based work job dependency system. The cleanup worker job just needs to poll the file system for all the intermediate files (e.g. for X in expected intermediate FASTA+TSV, sleep until file X exists).