oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
176 stars 40 forks source link

Help with interpretation of the output (.out.gff) #75

Closed ValentinaBoP closed 4 years ago

ValentinaBoP commented 4 years ago

Dear Shujun,

first, thanks for developing this powerful tool :)

I am a little confused about the interpretation of the last column of the .mod.out.gff file that should contain the whole-genome LTR-RT annotation by the non-redundant library. For example I was curious to specifically look at the LTRs present on the sequence CM000121.5.

CM000121.5  RepeatMasker    LTR/Gypsy   2866227 2866508 10.9    -   1800    CM000093.5:20437004..20442923_INT
CM000121.5  RepeatMasker    LTR/Gypsy   2866604 2866710 15.9    +   692 CM000106.5:2218171..2222316_INT
CM000121.5  RepeatMasker    LTR/Gypsy   2866693 2867024 22.9    +   1069    CM000121.5:1767470..1767798_LTR
CM000121.5  RepeatMasker    LTR/Gypsy   2867121 2867711 30.1    -   2048    CM000110.5:18883..20848_INT
CM000121.5  RepeatMasker    LTR/Gypsy   2867703 2868069 11.2    -   2484    CM000121.5:2374463..2381798_INT
CM000121.5  RepeatMasker    LTR/Gypsy   2868985 2869492 8.5 -   3458    CM000122.5:30302383..30302880_LTR

Column 4 and 5 contain the genomic coordinates for the LTR elements but what does the last column mean? Does it mean that the same sequence annotated onto CM000121.5 is also found on (first row) CM000093.5:20437004..20442923_INT?

Can you please briefly explain this output? Also, what are column 6 and 8?

Thank you for your help and time!

Valentina

oushujun commented 4 years ago

Hi Valentina,

Thank you for using LTR_retriever. It looks like you are using the <2.8.7 version, so the gff3 header is:

Chromosome Annotator Repeat_class/superfamily Start End Diversity(%) Strand SW_score Repeat_famliy

So column 9, Repeat_famliy, is the repeat family name. Here I used where the repeat was originally found as the family name - sorry for the confusion. Column 8, SW_score, is the Repeatmasker Smith–Waterman score. The higher the more confident the alignment between the library sequence and the annotated region. Column 6, Diversity(%), is the divergence between the library sequence and the annotated region.

For the v2.8.7+ region, I reorder the column information:

seqid source repeat_class/superfamily start end sw_score strand phase attributes

The 6th column changed to sw_score, and the divergence info is moved the 9th column, leaving the 8th column blank to cope with the standard GFF3 format.

Best, Shujun

ValentinaBoP commented 4 years ago

Thanks for the clarification!!