oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
188 stars 40 forks source link

How to solve the ambiguous of whole genome annotation and all LTR-RTs files? #32

Closed yuanhelianyi closed 5 years ago

yuanhelianyi commented 5 years ago

Hi, shujun

When I get the result of whole genome annotation and all LTR-RTs. I find the two results are ambiguous, which makes me confused. I think it's the result of using the uniq lib for annotation. Is that right? Is it necessary to remove the entries that differ from annotaion file in intact LTR_RTs file? Examples are as follows, only 4 of 11 intact LTR-RTs can be annotated precisely.

image image image image image image image image

yuanhelianyi commented 5 years ago

intact LTR_RTs file

LTR_loc Category Motif TSD 5_TSD 3_TSD Internal Identity Strand SuperFamily TE_type Insertion_Time

chr3B:67330..76355 pass motif:TGCA TSD:GTGGT 67325..67329 76356..76360 IN:67801..75884 0.9597 ? Gypsy LTR 1593198 chr3B:285364..287351 pass motif:TGCA TSD:TTTAG 285359..285363 287352..287356 IN:285900..286815 1.0000 - Copia LTR 0 chr3B:532514..541248 pass motif:TGCA TSD:GTAAG 532509..532513 541249..541253 IN:533004..540758 0.9531 ? Gypsy LTR 1862714 chr3B:998081..1006756 pass motif:TGCA TSD:GATAC 998076..998080 1006757..1006761 IN:999858..1004977 0.9747 - Copia LTR 989868 chr3B:1039812..1048675 pass motif:TGCA TSD:ACGAC 1039807..1039811 1048676..1048680 IN:1040293..1048195 0.9688 ? Gypsy LTR 1225675 chr3B:1384461..1392963 pass motif:TGCA TSD:ATA 1384458..1384460 1392964..1392966 IN:1386153..1391269 0.9539 - Copia LTR 1829911 chr3B:1464448..1478224 pass motif:TGCA TSD:ACTTG 1464443..1464447 1478225..1478229 IN:1466199..1476473 0.9926 - Copia LTR 286029 chr3B:1557619..1567700 pass motif:TGCA TSD:ACCAC 1557614..1557618 1567701..1567705 IN:1558136..1567183 0.9729 ? Gypsy LTR 1061605 chr3B:1634407..1648820 pass motif:TGCA TSD:CCATC 1634402..1634406 1648821..1648825 IN:1635948..1647278 0.9760 + Copia LTR 938169 chr3B:1698438..1712708 pass motif:TGCA TSD:CCGTT 1698433..1698437 1712709..1712713 IN:1702565..1708578 0.9920 + Gypsy LTR 309345 chr3B:2243109..2253093 pass motif:TGCA TSD:CCGCT 2243104..2243108 2253094..2253098 IN:2243580..2252622 0.9873 ? Gypsy LTR 492644

oushujun commented 5 years ago

Hello,

Yes, you are right. The mixed annotation is due to the use of uniq library. RepeatMasker (or rmblastn) just pick the entry that aligns closely to the query sequence. The intact LTR element structure has no guidance for this process. Due to the repetitiveness of TE sequences, their annotations are not as precise as genes.

Best, Shujun

yuanhelianyi commented 5 years ago

Hi, shujun Can I think that intact LTR-RTs accompany with mixed annotation is inaccurate? Should be removed from the results of intact LTR-RTs? Zhao Jing

oushujun commented 5 years ago

Hi Jing,

Not necessary. You may verify the LTR structure to confirm that. A lot of the case is an LTR element nested with other sequences, or vice versa.

Thanks, Shujun