swvanderlaan / MetaGWASToolKit

A ToolKit to perform a Meta-analysis of Genome-Wide Association Studies
https://swvanderlaan.github.io/MetaGWASToolKit/
MIT License
14 stars 2 forks source link

(Plotter) scripts that rely on generating data before queuing slurm command could be sped up #35

Closed MVPuijk closed 1 year ago

MVPuijk commented 1 year ago

Any part of a script that generates ".sh" files to queue them with sbatch after generating some data neccesary to run these ".sh" files could include generating said data into the ".sh" files themselves. For example:

    echo "- producing normal QQ-plots..." # P-value
    zcat ${PROJECTDIR}/${COHORTNAME}.${DATAEXT} | ${SCRIPTS}/parseTable.pl --col P | tail -n +2 > ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.txt

    printf "#!/bin/bash\nRscript ${SCRIPTS}/plotter.qq.R --projectdir ${PROJECTDIR} --resultfile ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.txt --outputdir ${PROJECTDIR} --stattype ${STATTYPE} --imageformat ${IMAGEFORMAT}" > ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.sh
    ## qsub -S /bin/bash -N ${COHORTNAME}.${DATAPLOTID}.QQ -o ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.log -e ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.errors -l h_vmem=${QMEMPLOTTER} -l h_rt=${QRUNTIMEPLOTTER} -wd ${PROJECTDIR} ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.sh
    QQ_ID=$(sbatch --parsable --job-name=${COHORTNAME}.${DATAPLOTID}.QQ -o ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.log --error ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.errors --time=${QRUNTIMEPLOTTER} --mem=${QMEMPLOTTER} --mail-user=${QMAIL} --mail-type=${QMAILOPTIONS} --chdir=${PROJECTDIR}/ ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.sh)

Would become something like:

    echo "- producing normal QQ-plots..." # P-value
    printf "#!/bin/bash\nzcat ${PROJECTDIR}/${COHORTNAME}.${DATAEXT} | ${SCRIPTS}/parseTable.pl --col P | tail -n +2 > ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.txt\n"
    printf "Rscript ${SCRIPTS}/plotter.qq.R --projectdir ${PROJECTDIR} --resultfile ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.txt --outputdir ${PROJECTDIR} --stattype ${STATTYPE} --imageformat ${IMAGEFORMAT}" >> ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.sh
    ## qsub -S /bin/bash -N ${COHORTNAME}.${DATAPLOTID}.QQ -o ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.log -e ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.errors -l h_vmem=${QMEMPLOTTER} -l h_rt=${QRUNTIMEPLOTTER} -wd ${PROJECTDIR} ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.sh
    QQ_ID=$(sbatch --parsable --job-name=${COHORTNAME}.${DATAPLOTID}.QQ -o ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.log --error ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.errors --time=${QRUNTIMEPLOTTER} --mem=${QMEMPLOTTER} --mail-user=${QMAIL} --mail-type=${QMAILOPTIONS} --chdir=${PROJECTDIR}/ ${PROJECTDIR}/${COHORTNAME}.${DATAPLOTID}.QQ.sh)

This should speed things up by making it so that generating the neccesary data becomes part of the sbatch commands, and thus can run at the same time as other similar commands, instead of stopping the script dead in its tracks until the data is generated.

WARNING: This approach will only work if the data generated is used in a single resulting sbatch command, like with the QQ plot in "gwas.plotter.sh". If the data generated is used in multiple commands, like with the Manhattan plots generated in "gwas.plotter.sh" then the data generation will need to stay seperate (although it could still be turned into an sbatch command regardless and turned into a dependancy for the Manhattan plots).

MVPuijk commented 1 year ago

This improvement was added in the plotting for the prep phase.