oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
176 stars 40 forks source link

LTR_FINDER_parallel and salvage mode #95

Closed anandksrao closed 3 years ago

anandksrao commented 3 years ago

Dear Shujun,

I seek your help with understanding how exactly to use salvage mode of your LTR_FINDER_parallel, and also how to avoid using the salvage mode itself, if possible. Before my questions, some context.

Generic syntax I am using: $ LTR_FINDER_parallel -seq $genome -threads 10 -harvest_out -size 1000000 -time 6000 Dependency check results:

$ LTR_FINDER_parallel -check_dependencies
Using this LTR_FINDER: /share/apps/LTR_FINDER_parallel-1.1/bin/LTR_FINDER.x86_64-1.0.7/
Pass!

Example snippet of STDOUT indicating LTR_FINDER_parallel works OK for the most part:

Tue Jun  1 15:05:30 2021 CPU1: running on Ahypog_1_sub1
Tue Jun  1 15:05:30 2021 CPU2: running on Ahypog_1_sub2
Tue Jun  1 15:05:30 2021 CPU3: running on Ahypog_1_sub3
Tue Jun  1 15:05:30 2021 CPU4: running on Ahypog_1_sub4
Tue Jun  1 15:05:30 2021 CPU5: running on Ahypog_1_sub5
Tue Jun  1 15:05:30 2021 CPU6: running on Ahypog_1_sub6
Tue Jun  1 15:05:30 2021 CPU7: running on Ahypog_1_sub7
Tue Jun  1 15:05:30 2021 CPU8: running on Ahypog_1_sub8
Tue Jun  1 15:05:30 2021 CPU9: running on Ahypog_1_sub9
Tue Jun  1 15:05:30 2021 CPU10: running on Ahypog_1_sub10

But a few of these parallel threads gave timeout messages in the same STDOUT:

Tue Jun  1 17:32:07 2021 CPU8: Ahypog_7_sub23 timeout, process it with the salvage mode
Tue Jun  1 19:49:29 2021 CPU3: Ahypog_19_sub64 timeout, process it with the salvage mode
Tue Jun  1 19:50:34 2021 CPU7: Ahypog_19_sub69 timeout, process it with the salvage mode
Tue Jun  1 20:18:37 2021 CPU8: Ahypog_24_sub1 timeout, process it with the salvage mode

So my questions to you about your LTR_FINDER_parallel and it's salvage mode are as follows, please:

Q1. Do I need to explicitly include the '-try1' flag or as the help menu indicates, is this already default?

Q2. To process the failed parts of my run in salvage mode, what is the syntax I should use?

Q3. Will the salvage mode take shorter time by recognizing failed parts of the run and attempt to repeat just for those genomic regions?

Q4. Is is theoretically possible for the salvage mode itself to fail? In that case, is the only option to use '-try0' flag, i.e. discard that entire genomic region, OR are there other workarounds?

Q5. Is one such workaround just increasing -time flag to a much larger values. As you can see, as it is I am using 6000, may be I should simply bump it up to 12K or 24K, or could this create any other problems?

Q6. Could another workaround be reducing the -size flag to smaller genomic sizes? As it is I am already using 1MB windows, rather than default 5MB windows, but could I reduce it further to 0.5MB window, perhaps?

I could try all these ideas but my univ HPCC is super busy these days, and starting a job is a long wait, so there's not much opportunity to try different syntax! - And so I am reaching out to you :) Thank you in advance!

Cheers, Anand

Help menu for the installation on my university HPC cluster:

LTR_FINDER_parallel -h

~ ~ ~ Run LTR_FINDER in parallel ~ ~ ~

Author: Shujun Ou (shujun.ou.1@gmail.com)
Date: 09/19/2018
Update: 01/28/2020
Version: v1.1

Usage: perl LTR_FINDER_parallel -seq [file] -size [int] -threads [int]
Options:    -seq    [file]  Specify the sequence file.
        -size   [int]   Specify the size you want to split the genome sequence.
                Please make it large enough to avoid spliting too many LTR elements. Default 5000000 (bp)
        -time   [int]   Specify the maximum time to run a subregion (a thread).
                This helps to skip simple repeat regions that take a substantial of time to run. Default: 1500 (seconds).
                Suggestion: 300 for -size 1000000. Increase -time when -size increased.
        -try1   [0|1]   If a region requires more time than the specified -time (timeout), decide:
                    0, discard the entire region.
                    1, further split to 50 Kb regions to salvage LTR candidates (default);
        -harvest_out    Output LTRharvest format if specified. Default: output LTR_FINDER table format.
        -next       Only summarize the results for previous jobs without rerunning LTR_FINDER (for -v).
        -verbose|-v Retain LTR_FINDER outputs for each sequence piece.
        -finder [file]  The path to the program LTR_FINDER (default v1.0.7, included in this package).
        -threads|-t [int]   Indicate how many CPU/threads you want to run LTR_FINDER.
        -check_dependencies Check if dependencies are fullfiled and quit
        -help|-h    Display this help information.
oushujun commented 3 years ago

Hi Anand,

Here are my answers:

Q1. Do I need to explicitly include the '-try1' flag or as the help menu indicates, is this already default? Q2. To process the failed parts of my run in salvage mode, what is the syntax I should use? No, the default is -try1 1 as suggested in the help info, so you don't need to do other things to enter the salvage mode, unless you don't want it (-try1 0).

Q3. Will the salvage mode take shorter time by recognizing failed parts of the run and attempt to repeat just for those genomic regions? Yes, and again, it's an automatic step.

Q4. Is is theoretically possible for the salvage mode itself to fail? In that case, is the only option to use '-try0' flag, i.e. discard that entire genomic region, OR are there other workarounds? The behind logic is pretty simple. If a window takes too long to finish, that means it has a pretty complex (or simple) structure such as tandem repeats. There are two ways to solve this, the first way is providing more -time so that LTR_FINDER can finish all possible candidates. The second way is to make the window shorter so that the number of candidates is significantly reduced. The purpose of this wrapper is fast execution, so I opt to the second solution, which further chops the original window (5Mb) into much shorter regions (50kb), aka, the salvage mode. The pitfall of splitting sequences is also obvious. If you split too much or make a window too small, LTRs can be split into different windows and lost. In our benchmark, this is not big and sometimes even a gain (see the paper).

Q5. Is one such workaround just increasing -time flag to a much larger values. As you can see, as it is I am using 6000, may be I should simply bump it up to 12K or 24K, or could this create any other problems? No problem as far as I know. For "difficult" windows, you just need to wait longer (i.e., up to 6000s per difficult window).

Q6. Could another workaround be reducing the -size flag to smaller genomic sizes? As it is I am already using 1MB windows, rather than default 5MB windows, but could I reduce it further to 0.5MB window, perhaps? Yes you can. See discussions under Q4.

LTR_FINDER_parallel is pretty quick and requires very tiny memory. You may request fewer CPUs and a longer time to get in a shorter queue.

Let me know if you have more questions.

Shujun