peterjc / thapbi-pict

Tree Health and Plant Biosecurity Initiative - Phytophthora ITS1 Classifier Tool
https://thapbi-pict.readthedocs.io/
MIT License
8 stars 2 forks source link

Cluster support via job arrays? #250

Open peterjc opened 4 years ago

peterjc commented 4 years ago

The following might be cluster agnostic, using the SGE array job or equivalent.

Given a job with N slots, run thapbi_pict pipeline --slot 1;N ... through thapbi_pict pipeline --slot i;N ... to thapbi_pict pipeline --slot N;N ... the cluster.

Each worker job would scan the FASTQ files (and/or folders?), determine the pairs, sort them, and then act of share i of N to run the prepare-reads and classify steps on the allocated samples.

Workers 1 to N-1 would then stop.

Worker N (which will be one of the last to finish) would wait for all the intermediate files, and then run the reports.

(You could get fancy with multiple cleanup workers, covering the different reports - the edit-graph for example is slow enough to split out)

The point of this idea is there is no master script, and no cluster based work job dependency system. The cleanup worker job just needs to poll the file system for all the intermediate files (e.g. for X in expected intermediate FASTA+TSV, sleep until file X exists).

peterjc commented 3 years ago

Documenting the core of my current setup using SLURM, which submits one worker job for each plate of raw data. It assumes a raw_data_by_plate/plate* style input tree.

First master script go.sh:

#!/bin/bash

# Array job workers each do a single plate (directory)
# Maps array task number (1 to N) to the directory name
# where there are N directories.

set -euo pipefail

Q=medium
N=`ls -1 raw_data_by_plate/ | grep ^plate | wc -l`
NAME="thapbi-pict-pipeline-$N"  # include today's date?

sbatch -p $Q --job-name=$NAME --array="1-$N" prepare_plate.sh
sbatch -p $Q --job-name=$NAME --dependency=singleton go_part2.sh

echo "Array job for FASTA files started, stage two pending (overall reports)"

Script prepare_plate.sh is run on each plate (can call thapbi_pict prepare-reads and thapbi_pict classify, or if a per-plate report is useful just call thapbi_pict pipeline with suitable output folder / name settings). The key trick here:

#!/bin/bash

#SBATCH --partition=short
#SBATCH --cpus-per-task=4
#SBATCH --mem=1G

# Array job worker thread to do a single plate (directory)
# Maps array task number (1 to N) to the directory name
# where there are N directories.

# Turn array job number into plate name (directory):
PLATE=`ls -1 raw_data_by_plate/ | grep ^plate | sort | sed -n ${SLURM_ARRAY_TASK_ID}p`

# etc

Using the SLURM singleton dependency trick, it waits until all the plate-level jobs are done, and moves onto stage two which can just run thapbi_pict pipeline on the entire input tree and re-use the intermediate files already created, and then produce reports at project level covering all the plates.

peterjc commented 3 years ago

Would #219 (intermediate files in a DB) break either approach to running the pipeline split up over multiple machines?