oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
176 stars 40 forks source link

Suggestions for optimization #97

Closed andreaswallberg closed 1 year ago

andreaswallberg commented 3 years ago

Dear @oushujun,

Thanks for a wonderful tool!

I have run this on a very large LTR_FINDER dataset from a large genome and poked around a little in the code. I have two suggestions for optimization:

  1. Check your Perl regexps in all scripts where you match or substitute the sequence header tags, i.e. ">" in FASTA files or "@" in the $tandem TRF output stream in cleanup.pl (and possibly elsewhere), to match only the first character on the line rather than possibly looking for the character across the whole line. So instead of /\>/ do /^\>/ or instead of /\@(.*)\n?// in cleanup.pl, do /^\@(.*)\n?//. I believe this is what the scripts are intended to do in most cases anyway, so do this avoids scanning for example very long DNA sequences for these symbols.
  2. Split TRF runs into multiple subsets so that they can be run in parallel from cleanup.pl. I have written a drop-in replacement bash script (that executes parallel TRF jobs on chunks, waits for them to finish and the cats output backup to cleanup.pl) and a Perl script that first splits the genome into those nearly equally sized chunks. In cleanup.pl, I set $trf_path="$script_path/run_trf_in_parallel.sh". Your flags are passed exactly as before and nothing else in cleanup.pl is changed. However, because you currently do not pass a variable for the number of threads/jobs to the trf process, I have just hard coded it to 40 for now.

Let me know if you are interested in these two files for point 2. They are very simple and I could pretty much paste them as code here.

oushujun commented 3 years ago

Dear Andreas,

Thank you so much for these wonderful suggestions. I will modify the regex as suggested in the next version. Please PR the parallel scripts to the repo and I will combine them to main.

Best, Shujun

On Mon, Jun 21, 2021 at 4:12 AM Andreas Wallberg @.***> wrote:

Dear @oushujun https://github.com/oushujun,

Thanks for a wonderful tool!

I have run this on a very large LTR_FINDER dataset from a large genome and poked around a little in the code. I have two suggestions for optimization:

  1. Check your Perl regexps in all scripts where you match or substitute the sequence header tags, i.e. ">" in FASTA files or "@" in the $tandem TRF output stream in cleanup.pl (and possibly elsewhere), to match only the first character on the line rather than possibly looking for the character across the whole line. So instead of />/ do /^>/ or instead of /\@(.)\n?// in cleanup.pl, do /^\@(.)\n?//. I believe this is what the scripts are intended to do in most cases anyway, so do this avoids scanning for example very long DNA sequences for these symbols.
  2. Split TRF runs into multiple subsets so that they can be run in parallel from cleanup.pl. I have written a drop-in replacement bash script (that executes parallel TRF jobs on chunks, waits for them to finish and the cats output backup to cleanup.pl) and a Perl script that first splits the genome into those nearly equally sized chunks. In cleanup.pl, I set $trf_path="$script_path/run_trf_in_parallel.sh". Your flags are passed exactly as before and nothing else in cleanup.pl is changed. However, because you currently do not pass a variable for the number of threads/jobs to the trf process, I have just hard coded it to 40 for now.

Let me know if you are interested in these two files for point 2. They are very simple and I could pretty much paste them as code here.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/97, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NH2LRYJU2I5377DH73TTZDSRANCNFSM47AMGSAA .

xiekunwhy commented 2 years ago

Replace TRF by Look4TRs may be a good idea to speed up running.

oushujun commented 2 years ago

Hi @xiekunwhy,

Thank you for the suggestion. Do you know if Look4TRs generates the same result as TRF? If not it will need to be recalibrated with benchmarks to guarantee the downstream results.

Best, Shujun