oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
179 stars 40 forks source link

Output gff parameters #20

Closed xinshuaiqi closed 5 years ago

xinshuaiqi commented 5 years ago

Hi Shujun, I have a few question while running LAI. I am new to LTR, so I hope you can help me clarify a few things:

gff out file:

1) What's the meaning of Diversity(%) and SW_score in the gff output? 2) And if the sequece-region has 'INT' as suffix, does that mean "intact LTR"? Better to have an explanation of the Output file in the manual.

LAI output:

1) when you calculate the whole genome raw LAI and LAI, do you use the mean value of all the scaffolds? what if my input file is a contig file, and many short contigs have no LTR (thus raw LAI and LAI = 0), will that impact on the whole genome raw LAI and LAI? 2) I am not fully understand the correction on raw LTR that you mentioned in the paper "LAI score is correlated with the activities of LTR-RT ...". Does this correction has bias between autopolyploid and allopolyploid plants?
3) what the decimal number of Intact and Total mean in the LAI output? Percentage of this LTR in this sequence?

Repeatability:

It happen to me that If I run the analysis on the same genome, I may get different LAI score, is that normal? Is that because you used any seed in the script?

I do get both ".mod.out" and '.out' as output, what's the difference?

About finder and harvest

Is there a way to estimate the False positive rate in these two outputs based on your retriever output?

oushujun commented 5 years ago

Hi Xinshuai,

Thank you for your questions, which may help to clarify things for other users. Please refer to the following information.

gff out file:

  1. These two values are derived from RepeatMasker. Basically Diversity(%) = 100 - identity(%) and may also be subjected to adjustment depending on the nucleotide substitution model. SW_score is an alignment confidence score. The higher the better. RepeatMasker hits with SW_score < 300 were filtered out.
  2. INT means internal region. Intact LTR-RTs were split into three pieces LTR-INT-LTR then clustered based on cd-hit or blastclust.

LAI output:

  1. whole-genome LAI/raw LAI is the mean of LAI/raw LAI of all input sequence, including all scaffolds and contigs, with LTR or not. Substantial single contigs with no LTRs will decrease both values, since they are not assembled and thus low assembly continunity (low LAI).
  2. The correction needs to be done due to the activity of LTR-RT is different between species, especially for between-species assembly comparison. So far we have not identified substantial bias during this correction process. For polyploid species, due to the duplication of whole genome, the correction could not be done properly. So you need to calculate LAI on one subgenome at a time by proving a list of chromosome/scaffold/contig names of that subgenome using the -mono parameter.
  3. These two numbers should be provided in percentage format but leave out the "%".

Repeatability: You should expect a slight difference (<.1) between runs. Due to the use of multithreading, unfortunately, there isn't a way to control this slight variation for now. For LAI estimation, the correction procedure is set to the quick mode (-q) in LTR_retriever by default, which is to apply a linear estimation of whole-genome LTR identity to accelerate the procedure, instead of performing whole-genome all-versus-all blast of LTR sequences. You may try the latter method to control this potential approximation error. genome.mod.out is the RepeatMasker result by using the LTR_retriever produced library to mask the genome.mod file. genome.mod is created due to the length of the sequence name >15 characters. Naming length is limited by RepeatMasker so I did the trimming here. Sequences are untouched.

About finder and harvest False positive rate is the percentage of the falsely reported cases. To determine whether an item is false, you need to have a standard set, or true answer, to be compared with. For de novo LTR annotations, if there is no human curation, there is no standard set, thus you can not know the false positive rate. You may curate a subset of the intact LTR-RT to get an insight. Instead, we provide come curated examples in our paper supplementary data to compare different combinations of inputs. You can calculate the false positive rate by using the equations in the method section.

Please let me know if you have further questions.

Thanks, Shujun

xinshuaiqi commented 5 years ago

Thanks for clarifying these questions! Best, Xinshuai