lmrodriguezr / enveomics

Scripts and libraries for Environmental Genomics
http://enve-omics.ce.gatech.edu/enveomics
Other
37 stars 28 forks source link

aai-matrix is really slow #25

Open sbridel opened 7 years ago

sbridel commented 7 years ago

I can't re-open my issue about multithreading. So I re post my message here

Unfortunately i can't read MiGA documentation. I see only the titles on the doc.

I was wondering how blast is called within aai.rb. Then, I wonder how to add mpiBLAST (http://www.mpiblast.org/) in aai.rb to do the job using BLASt on several CPUs Or, how to add parallel / xargs to aai-matrix.sh to parallelize the loop. I don't know if it's better to launch the main loop in several jobs or launch the nested loop as several jobs. Then, aai.rb will be launched as several jobs, each jobs calling BLAST. This solution doesn't allow faster BLAST pairwise comparisons, but make aai.rb between some pairs of genomes in parallel at the same time so it may reduce the computing time. But, I haven't done something like this before and I don't really know how to do that

for i in "${SEQS[@]}" ; do
  for j in "${SEQS[@]}" ; do
    echo -n " o $i vs $j: "
    AAI=$(aai.rb -1 "$i" -2 "$j" -S "$OUT.db" -t "$THR" \
      --no-save-rbm --auto --quiet)
    echo ${AAI:-Below detection}
    [[ "$i" == "$j" ]] && break
  done
done

I want to add that I have launched AAI comparaison using aai-matrix. I have 35 genomes. from 11h30 am to 5p.m, it have only done 2 species againts the 35 species (so 70 comparisons). I wonder how much time will take the full analysis.

I'm pretty sure that parallel can perform well by parallelizing the loops. I tried to make it works but I have never used parallel before so I hoped for some help now.

This will be a great features for your tool.