Closed sgarciah12 closed 1 year ago
Hey Samuel,
Thanks for using the script! I get your point, but I have implemented the fundamental approach. To have such a feature as you are requesting it, it would be best to embed the script in a proper Workflow Management System, e.g., Nextflow. Then you could -resume
a run etc...
I will consider that, it sounds like a fun task when there is some time. But right now I am unfortunately quite full w/ other ToDos so I can't promise anything soon.
Hello @hoelzer,
Thank you for your quick answer!
I agree, ideally, a workflow management system would be the solution, but I understand it might take time.
I have been thinking and I think a temporary solution could be to use the option that you implemented for this issue https://github.com/hoelzer/pocp/issues/2 and then loop pocp
over all genomes in the directory reporting a different results file for each that then could be fused into one. This way, if the run is stopped you can quickly start over in the last genome analyzed by filtering out those that were already processed.
Would that work?
The implementation I am thinking of would be something like this (in bash):
for $genome in $directory; do
echo "Genome $genome being analyzed..."
./pocp.rb $directory ${genome}_results 2 $genome
echo "Genome $genome done, moving to next genome"
done
Hi @sgarciah12, you are right! That would do the trick! And when you stop it or it breaks, you could resume from a subdirectory that only has the files that need processing.
In the end, you merge everything into one table.
Hello,
I forgot to mention that this strategy worked perfectly well, especially when implemented in a cluster where you can parallelize the jobs. You can compare thousands of pairs of genomes in just a few hours.
I implemented my code for a SLURM-based job manager, so in case someone finds it useful, here is how I did it:
#!/bin/sh
#SBATCH --array=1-164%164
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --mem 100G
#SBATCH --job-name=POCP
module purge
module load ruby
module load blast+
echo -e "RUNNING \n"
species=$(head -n ${SLURM_ARRAY_TASK_ID} species_list.txt | tail -n 1)
echo -e "Genome $species being analyzed \n"
srun ./pocp.rb spiro_database/ ${species}_results/ 10 $species
echo -e "Genome $species done! \n"
Where species_list.txt
is just a list of the filenames of your genomes with one genome per line and 164 the number of genomes that I had.
Afterwards, you just have to fuse all .csv files into a single one and that's it!
Thanks for your feedback @hoelzer !
Oh great, thx for the feedback @sgarciah12 ! I might add your approach to the README until finding the time for a Nextflow implementation ;)
Hello,
I am running
pocp
for 164 genomes (26896 pairs) but I had to cancel the run before it finished analyzing all the pairs. Now I am running it again to obtain the remaining comparisons for the rest of the pairs and I have the feeling that the program is running the same comparisons again. Is there a way to avoid that to prevent the calculations to take forever when comparing large numbers of genomes?In addition, it would be very useful to have the output file reported even if the run is cancelled before finishing, this way I could at least have some preliminary data.
Thank you for your time and help with this tool!
Best regards
Samuel