oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
177 stars 40 forks source link

get flanking regions of all LTRs #38

Closed reubwn closed 5 years ago

reubwn commented 5 years ago

Hello,

First, thanks for the really useful program :)

I would like to generate a library of all the upstream and downstream flanking regions adjacent to each LTR insertion in my reference genome. So, the downstream flanking region from the 5' LTR and the upstream flanking region from the 3' LTR (if that is clear?). I can then look for reads that span these genome-LTR boundaries in other sequencing datasets, to test for presence/absence of each LTR insertion.

What would be your recommended approach, based on the output of LTR_retriever? For example, the file ".pass.list.gff3" has very clear structural components for intact LTR-RTs, including both 5' and 3' LTRs themselves, but this file only has intact LTRs I think. The whole-genome annotation file "*.out.gff" has many more candidates, but it is a bit unclear what is exactly in this file: e.g., just LTRs themselves, or possibly other parts of the LTR-RT, including internal CDS and/or the whole element? Also, this file might contain internal and/or nested LTR-RTs, which might confuse things. Maybe the file "*.LTRlib.nonredundant.fa" is a better way to start, using BLAST to get the genomic coordinates for each LTR entry in this file. In this case, am I correct to say that the "*.LTRlib.fa" files contain only the LTR regions themselves?

Any advice would be much appreciated, and please let me know if I've misinterpreted some of the output files discussed above.

Many thanks for your time, reubwn

oushujun commented 5 years ago

Hello @reubwn,

Thank you for using LTR_retriever. I would like to first clarify the purpose of the files you mentioned. Your understanding of the ".pass.list.gff3" file is correct, that it contains 5' LTR, 3' LTR, and internal regions. This file describes LTR information in the ".pass.list" file, which includes all intact LTR elements identified by LTR_retriever.

The "*.out.gff" file contains LTR annotation of the entire genome, which annotation is performed by RepeatMasker with the LTR_retriever generated library. If your goal is to study whole-genome LTR sequences including degraded fragments, internal coding sequences and all LTR related components, you may use this file.

The ".LTRlib.fa" file is a non-redundant version of the ".LTRlib.redundant.fa" file, which contains both LTR and internal regions in separate sequences. You may use the non-redundant library for annotations.

To approach your goal, there is a script that fits your exact need: /LTR_retriever/bin/call_seq_by_list.pl. For usage, perl /LTR_retriever/bin/call_seq_by_list.pl -h and also read Manual.pdf. You may prepare the list of LTR sequences based on the ".pass.list" file for intact LTR elements only or the "*.out.gff" file for all LTR-related sequences. Following the list format: any_name chr_name:start..end for positive strand sequence. any_name chr_name:end..start for negative strand sequence.

Please let me know if you need further help.

Best, Shujun

reubwn commented 5 years ago

Hi Shujun,

Many thanks for your clarifications of the output files, this is very helpful.

Cheers, reubwn

oushujun commented 5 years ago

Hi reubwn,

I am glad it helps! I will close this thread for now. If you have further questions, please reopen it or comment directly below.

Best, Shujun