oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
176 stars 40 forks source link

Reg. Copia, Gypsy, and Unknown or total LTR density calculation #94

Closed Yedomon closed 3 years ago

Yedomon commented 3 years ago

Dear Dr. Shu Jun Ou, I have predicted the Copia, Gypsy, and Unknown. However, when I tried to map the identified repeats alongwith cds using RIdeogram for 50Kb size, repeats show very less density. For eg., kb Gene LTR 1-50 3 0 51-100 8 0 101-150 10 0 151-200 9 0 201-250 8 0 251-300 11 0 301-350 9 0 351-400 9 1 ........ I think for calculating LTR density we need to use different strategies. Though it is not directly related to the LTR_retriever, could you kindly help to figure out how to use LTR_retriever data for calculating Copia, Gypsy, and unknown for sliding windows of 50 kb with 25 kb step size.

Thanking you!

Regards, Prabhu, S

oushujun commented 3 years ago

You can use the gff3 file and R to do so. Please make sure you are summarizing the base pairs but not pieces of fragments in your window.

Shujun

On Tue, Jun 1, 2021 at 4:22 PM Yedomon @.***> wrote:

Dear Dr. Shu Jun Ou, I have predicted the Copia, Gypsy, and Unknown. However, when I tried to map the identified repeats alongwith cds using RIdeogram for 50Kb size, repeats show very less density. For eg., kb Gene LTR 1-50 3 0 51-100 8 0 101-150 10 0 151-200 9 0 201-250 8 0 251-300 11 0 301-350 9 0 351-400 9 1 ........ I think for calculating LTR density we need to use different strategies. Though it is not directly related to the LTR_retriever, could you kindly help to figure out how to use LTR_retriever data for calculating Copia, Gypsy, and unknown for sliding windows of 50 kb with 25 kb step size.

Thanking you!

Regards, Prabhu, S

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/94, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NFHAQBA23EOF4JOOF3TQSKFZANCNFSM454FB2AQ .

Yedomon commented 3 years ago

Yes, I used gff3 file and used keywords "Copia","Gypsy", and "unknown" to measure the density, it seems it is wrong method to calculate with keywords. There should be like summarizing the base pairs like you said. Numbers of LTR should be in balance with the Gene density. Could you kindly tell me how to "summarize the base pairs not pieces of fragments" ?

Thaking you!

Regards, Prabhu, S

Prabhu89-code commented 3 years ago

Dear Dr . Shu Jun, I am using LTR_retreiver to calculate the followings,

  1. LTR location along the chromosome length (for density mapping),
  2. Insertion time of identified LTRs,
  3. Percentage of LTRs in the genome and percentage of Copia, Gypsy in the identified LTRs.

For 1st query I used, xxxx.LTR.gff3 file. For 2nd I used xxxx.pass.list.gff3 (intact) and nmtf.pass.list (Non-TGCA). Now to calculate the percentage, which one I should use. I believe xxxx.LTR.gff3 is correct but it is not the same with intact and non-TGCA files. Locations are more frequent than the intact and non-TGCA files. Numbers and locations are different between each other.

Also percentage calculated in the xxx.out.superfam.size is not matching when we calculate manually with the bps/total bp of genome. Is bps means total "base pairs" of copia, gypsy, and unknown in the given genome ?

could you kindly clarify above my doubts on usuage of LTR_retriever results. I hope I am not burdening you.

Thanking you!

Regards, Prabhu, S

oushujun commented 3 years ago

Hi Prabhu,

The gff3 file is for genome visualization not for summary, because there are nested elements and overlapping annotations in the file and will result in overestimation if simply adding things up.

For 2, are you sure you just want to look at non-TGCA LTRs? These are quite rare (~1% frequency).

For 3, you may need to write your own code for such a summary. You can check out RIdeogram or similar, and modify their code for your purpose.

Shujun

Prabhu89-code commented 3 years ago

Dear Dr. Shu Jun, Thank you for your reply. I understand first and second answers. For calculating precentage of Copia, Gypsy, and Unknown in the whole genome, which file I should use ? is xxx.LTR.gff3 is enough ? Because file with xxx.out.superfam.size is not matching with the manual calucation.

Thanking you!

Regards, Prabhu, S

oushujun commented 3 years ago

xxx.LTR.gff3 is enough. As I mentioned in the previous reply, xxx.out.superfam.size considered nested insertion and overlapping. Manual calculation tends to overestimate.

Prabhu89-code commented 3 years ago

LTR_retriever_eg

Dear Dr. Shu Jun, Here I have attached the image of file which I am planning to calculate the Copia and Gypsy over the chromosome.

  1. In this case am I right to map as follows, Copia 1-50kb 5 Gypsy 1-50kb 1 Copis 51-100kb 9 Gypsy 51-100kb 7

  2. What is column 6 stands for (which I have highlighted in yellow).

  3. Is there any ways to calculate insertion period of LTRs in xxx.LTR.GFF3 or it is only possible for intact and non-TGCA LTRs.

Thanking you!

Regards, Prabhu, S

Prabhu89-code commented 3 years ago

Dr. Shu Jun, Could you please also clarify what is the difference between "intact TGCA LTR" in xxxx.pass.LIST and "LTR" present in xxx.LTR.GFF3 files. It will be very much helpful for preparing manuscript.

Thanking you!

Regards, Prabhu, S

oushujun commented 3 years ago

The gff3 file contains both stucturally intact and fragmented LTRs. The last tag At the last column tells that info. "homology" means annotated using RepeatMasker. The pass.list file only contains intact LTRs which could be used to calculate insertion time.

Column 6 is alignment score if I remember correctly.

The number of fragmented LTRs can not represent the number of LTR insertions. Please see other discussions.

Shujun

On Tue, Jun 29, 2021 at 2:53 PM Prabhu89-code @.***> wrote:

Dr. Shu Jun, Could you please also clarify what is the difference between "intact TGCA LTR" in xxxx.pass.LIST and "LTR" present in xxx.LTR.GFF3 files. It will be very much helpful for preparing manuscript.

Thanking you!

Regards, Prabhu, S

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/94#issuecomment-870295701, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NDRZK4XISNB27SAG4DTVFUWFANCNFSM454FB2AQ .

Prabhu89-code commented 3 years ago

Gypsy_Copia_Landscape

My goal is to make figure like the above. So I would like to choose the correct result to plot the Copia and Gypsy landscape on the chromosome. Could you kindly let me know, which file I need to choose to measure the density of Copia and Gypsy.

Thanking you!

Regards, Prabhu, S

oushujun commented 3 years ago

Hi Prabhu,

It depends on your research goal. If you want to visualize the abundance of intact LTR elements, then use the intact.list; if you want to visualize all LTR abundance, then you should use the whole-genome LTR annotation (gff3). You probably want to calculate the percentage not count, because they have very different lengths.

Best, Shujun

On Wed, Jun 30, 2021 at 4:33 PM Prabhu89-code @.***> wrote:

[image: Gypsy_Copia_Landscape] https://user-images.githubusercontent.com/54341159/123928302-c19e5d80-d9c8-11eb-9c27-493260e35d1b.jpg

My goal is to make figure like the above. So I would like to choose the correct result to plot the Copia and Gypsy landscape on the chromosome. Could you kindly let me know, which file I need to choose to measure the density of Copia and Gypsy.

Thanking you!

Regards, Prabhu, S

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/94#issuecomment-871205469, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NDNIL6LE4WAL74U3N3TVLJGVANCNFSM454FB2AQ .

Prabhu89-code commented 3 years ago

Dear Dr. Shu Jun, Thank you very much for your answer. I got clarification. I will use .gff3 result for landscape of all copia and gypsy in chromosome.

Thanking you!

Regards, Prabhu, S