oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
188 stars 40 forks source link

difference LAI index quick run and full run #47

Closed KristinaGagalova closed 5 years ago

KristinaGagalova commented 5 years ago

Hi I am running the LAI index on large plant genomes (~20Gbp) and also very fragmented genome assembly. I have the following results for -q run and standard mode run Complete

Chr From    To  Intact  Total   raw_LAI LAI
whole_genome    1   24626904232 0.0037  0.5263  0.70    1.55

Quick

Chr From    To  Intact  Total   raw_LAI LAI
whole_genome    1   24626904232 0.0037  0.5263  0.70    6.33

The difference is quite significant for the final LAI index. the raw LAI looks the same. Is that a normal behavior?

For another run I have the following Warning:

Warning: [blastn] lcl|Query_272220 s0165314:245..410|s0165314:245..410: Warning: Could not calculate ungapped Karlin-Altschul parameters due to an invalid query sequence or its translation. Please verify the query sequence(s) and/or filtering options 
Warning: [blastn] lcl|Query_1541219 s1185854:917..1100|s1185854:917..1100: Warning: Could not calculate ungapped Karlin-Altschul parameters due to an invalid query sequence or its translation. Please verify the query sequence(s) and/or filtering options

Is there something wrong in the sequence?

Thank you in advance for the reply!

oushujun commented 5 years ago

Hi @KristinaGagalova ,

Since different genome has different LTR dynamics (boom and burst), for interspecific comparison such dynamics need to be controlled. I used the whole-genome LTR identity which is estimated by whole-genome all-versus-all blast, to account for the species-wise difference. However, all-versus-all blast could be very slow for big genomes. The quick mode is a linear extrapolation of 3 small sample estimations, which can reduce the time for a full estimation. If you get the chance to estimate the whole genome, then you should use it. That's why you see the same raw LAI but different LAI.

The warnings do not matter, they are from blast probably due to some random sequence errors.

Please let me know if you have further questions.

Best, Shujun

KristinaGagalova commented 5 years ago

Thank you for the explanation @oushujun