hoelzer / pocp

Calculation of the Percentage of Conserved Proteins following Qin, Xie et al. 2014 but using DIAMOND instead of BLASTP for alignments.
GNU General Public License v3.0
22 stars 5 forks source link

Avoid existing calculations #4

Closed sgarciah12 closed 1 year ago

sgarciah12 commented 1 year ago

Hello,

I am running pocp for 164 genomes (26896 pairs) but I had to cancel the run before it finished analyzing all the pairs. Now I am running it again to obtain the remaining comparisons for the rest of the pairs and I have the feeling that the program is running the same comparisons again. Is there a way to avoid that to prevent the calculations to take forever when comparing large numbers of genomes?

In addition, it would be very useful to have the output file reported even if the run is cancelled before finishing, this way I could at least have some preliminary data.

Thank you for your time and help with this tool!

Best regards

Samuel

hoelzer commented 1 year ago

Hey Samuel,

Thanks for using the script! I get your point, but I have implemented the fundamental approach. To have such a feature as you are requesting it, it would be best to embed the script in a proper Workflow Management System, e.g., Nextflow. Then you could -resume a run etc...

I will consider that, it sounds like a fun task when there is some time. But right now I am unfortunately quite full w/ other ToDos so I can't promise anything soon.

sgarciah12 commented 1 year ago

Hello @hoelzer,

Thank you for your quick answer!

I agree, ideally, a workflow management system would be the solution, but I understand it might take time.

I have been thinking and I think a temporary solution could be to use the option that you implemented for this issue https://github.com/hoelzer/pocp/issues/2 and then loop pocp over all genomes in the directory reporting a different results file for each that then could be fused into one. This way, if the run is stopped you can quickly start over in the last genome analyzed by filtering out those that were already processed.

Would that work?

The implementation I am thinking of would be something like this (in bash):

for $genome in $directory; do

    echo "Genome $genome being analyzed..."

    ./pocp.rb $directory ${genome}_results 2 $genome

    echo "Genome $genome done, moving to next genome"

done
hoelzer commented 1 year ago

Hi @sgarciah12, you are right! That would do the trick! And when you stop it or it breaks, you could resume from a subdirectory that only has the files that need processing.

In the end, you merge everything into one table.

sgarciah12 commented 1 year ago

Hello,

I forgot to mention that this strategy worked perfectly well, especially when implemented in a cluster where you can parallelize the jobs. You can compare thousands of pairs of genomes in just a few hours.

I implemented my code for a SLURM-based job manager, so in case someone finds it useful, here is how I did it:

#!/bin/sh

#SBATCH --array=1-164%164
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --mem 100G
#SBATCH --job-name=POCP

module purge
module load ruby
module load blast+

echo -e "RUNNING \n"

species=$(head -n ${SLURM_ARRAY_TASK_ID} species_list.txt | tail -n 1)

echo -e "Genome $species being analyzed \n"

srun ./pocp.rb spiro_database/ ${species}_results/ 10 $species

echo -e "Genome $species done! \n"

Where species_list.txt is just a list of the filenames of your genomes with one genome per line and 164 the number of genomes that I had.

Afterwards, you just have to fuse all .csv files into a single one and that's it!

Thanks for your feedback @hoelzer !

hoelzer commented 1 year ago

Oh great, thx for the feedback @sgarciah12 ! I might add your approach to the README until finding the time for a Nextflow implementation ;)