oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
346 stars 73 forks source link

Draw Repeat Landscapes #92

Closed CraigMichell closed 4 years ago

CraigMichell commented 4 years ago

Hi,

This is not a complaint about the program at all. I think that it is great and easy to use. I have been able to identify repeat elements in two genomes now with relative ease. I am a relative newbie to the area of repeat genomics and I would like to continue the analysis with the repeats identified from the EDTA pipeline.

Would you be able to advise on the best way to use the information from EDTA to create a repeat landscape type plot? I tried taking the repeat library created and masking my genome using that with RepeatMasker and following the next few steps for creating a repeat landscape plot. But the landscape plot is empty, I have a feeling it is because the classification of repeats is different? But as I said, I am a bit naive when it comes to analysis of repeat elements.

Any help or suggestions would be greatly appreciated. Thanks!

oushujun commented 4 years ago

Hello,

Thank you for using EDTA. There are many ways you can approach your goal, but most, if not all, of them, require some coding skills in one kind or the other. I have an old code that was developed to summarize LTR density in the LTR_retriever package: https://github.com/oushujun/LTR_retriever/blob/master/bin/LTR_sum.pl.

Alternatively, you may try karyoploteR, which is made to fulfill similar tasks. But again, this time you need some R coding skills. Similarly, you can also try a simpler version in ggplot: https://www.biostars.org/p/69748/

Hope these are helpful and feel free to share your codes here.

Best, Shujun

KristinaGagalova commented 4 years ago

I don't know if you figured that out but I've made my own visualization with R using the results from RepeatMakser - createRepeatLandscape.pl + the histogram information at the bottom of the output file. You can create a stacked barplot with ggplot2 which works perfectly for a publication. Let me know if you still need hep with that!

oushujun commented 4 years ago

@KristinaGagalova Can you share the codes and usage here so that other uses may benefit from your hard work? Thanks!

KristinaGagalova commented 4 years ago

Yes, of course! Should I just upload my script here with some examples data or anywere you like better?

oushujun commented 4 years ago

@KristinaGagalova Please upload them here. I have pinned this issue to the top so that everyone looking for this kind of illustration will see it. Thanks Kristina!

KristinaGagalova commented 4 years ago

Code is below

#############################
###Plot Kimura distance######
#############################

library(reshape)
library(ggplot2)
library(viridis)
library(hrbrthemes)
library(tidyverse)
library(gridExtra)

sessionInfo()
#R version 3.6.1 (2019-07-05)
#Platform: x86_64-conda_cos6-linux-gnu (64-bit)
#Running under: Ubuntu 20.04.1 LTS

#attached base packages:
#  [1] stats     graphics  grDevices utils     datasets  methods   base     
#other attached packages:
#  [1] gridExtra_2.3     forcats_0.5.0     stringr_1.4.0     dplyr_0.8.5       purrr_0.3.4      
#  [6] readr_1.3.1       tidyr_1.1.0       tibble_3.0.1      tidyverse_1.3.0   hrbrthemes_0.8.0 
#  [11] viridis_0.5.1     viridisLite_0.3.0 ggplot2_3.3.0     reshape_0.8.8    

KimuraDistance <- read.csv("/path/to/the/tbl/file",sep=" ")

#add here the genome size in bp
genomes_size=230000000

kd_melt = melt(KimuraDistance,id="Div")
kd_melt$norm = kd_melt$value/genomes_size * 100

ggplot(kd_melt, aes(fill=variable, y=norm, x=Div)) + 
  geom_bar(position="stack", stat="identity",color="black") +
  scale_fill_viridis(discrete = T) +
  theme_classic() +
  xlab("Kimura substitution level") +
  ylab("Percent of the genome") + 
  labs(fill = "") +
  coord_cartesian(xlim = c(0, 55)) +
  theme(axis.text=element_text(size=11),axis.title =element_text(size=12))

Below a the expected input (generated by createRepeatLandscape.pl from RepeatMasker) Test_createRepeatLandscapePlot.txt

Expected output Rplot04

This plots all the classes of repeats, we may want to create a more efficient grouping of those, using the output from EDTA and including that into few classes of repeats.

oushujun commented 4 years ago

@KristinaGagalova Thanks for sharing the code. I notice that Gypsy is missing the input and the figure, is it something wrong or you want to leave it out?

KristinaGagalova commented 4 years ago

@oushujun Just for double checking, below the summary from EDTA

Repeat Classes
==============
Total Sequences: 5497
Total Length: 201821862 bp
Class                  Count        bpMasked    %masked
=====                  =====        ========     =======
DNA                    --           --           --   
    DTA                10262        2182169      1.08% 
    DTC                15205        3015590      1.49% 
    DTH                4129         831309       0.41% 
    DTM                24634        5549959      2.75% 
    DTT                2161         400460       0.20% 
    Helitron           16470        3333789      1.65% 
LTR                    --           --           --   
    Copia              23           7180         0.00% 
    unknown            6364         1346226      0.67% 
MITE                   --           --           --   
    DTA                1946         352337       0.17% 
    DTC                353          73131        0.04% 
    DTH                315          60310        0.03% 
    DTM                1804         297687       0.15% 
    DTT                4            448          0.00% 
                      ---------------------------------
    total interspersed 83670        17450595     8.65%

---------------------------------------------------------
Total                  83670        17450595     8.65%

Looks like the pipelinen did not identify any of LTR Gypsy, I am not sure if that's unusual. I have also added the repeats from Repbase from a closer species that has several LTR but looks like RepeatMasker did not identify also those This is hower a small genomes and with low % of repeats. If you have any feedback on that may be helpful, but the image reflects the repeats annotation.

oushujun commented 4 years ago

@KristinaGagalova I see, that makes sense! Thanks for sharing your data and the code.

aaronphillips7493 commented 3 years ago

Hello,

I am interested in producing a similar plot for my genome of interest. I have recently run EDTA on the genome. Is the Kimura substitution plot possible to make from the output of EDTA (i.e. is the output in one of the RepeatMasker directories)? r do I need to run RepeatMasker separately using the TE library generated by EDTA?

Thank you in advance, Aaron :)

oushujun commented 3 years ago

Hi Aaron,

If you check the above conversations, you will see the input is obtained from createRepeatLandscape.pl from RepeatMasker. While I never tried myself, you may check it out and see what the script needs.

Best, Shujun

On Fri, Feb 26, 2021 at 12:20 PM aaronphillips7493 notifications@github.com wrote:

Hello,

I am interested in producing a similar plot for my genome of interest. I have recently run EDTA on the genome. Is the Kimura substitution plot possible to make from the output of EDTA (i.e. is the output in one of the RepeatMasker directories)? r do I need to run RepeatMasker separately using the TE library generated by EDTA?

Thank you in advance, Aaron :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/92#issuecomment-786400960, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NCE6GX4WEZRDOB53F3TA4ORHANCNFSM4OTDDQUA .

aaronphillips7493 commented 3 years ago

Hey Shujun,

Thank you for your reply. I apologise, I got confused by the "RM" files generated by EDTA. I forgot that they come from Repeat Modeller not Repeat Masker. It seems I will have to run Repeat Masker separately and then follow createRepeatLandscape.pl from RepeatMasker.

Aaron :)

oushujun commented 3 years ago

Sorry this is confusing. RM could be both in different context. It means RepeatModeler in the final folder and RepeatMasker in all other contexts. So you can get a masker-like out file in the anno folder, but I am not sure if RepeatMasker can recognize it. -Shujun

On Fri, Feb 26, 2021 at 2:24 PM aaronphillips7493 notifications@github.com wrote:

Hey Shujun,

Thank you for your reply. I apologise, I got confused by the "RM" files generated by EDTA. I forgot that they come from Repeat Modeller not Repeat Masker. It seems I will have to run Repeat Masker separately and then follow createRepeatLandscape.pl from RepeatMasker.

Aaron :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/92#issuecomment-786440931, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NG7FT7I55WLYT3R3BLTA45BPANCNFSM4OTDDQUA .

CraigMichell commented 3 years ago

Now that I have finally been able to come back to analysing the repeats almost a year later. I just wanted to thank everyone for their suggestions and help with this.

I thought I would outline the commands that I used in the end for generating a repeat landscape from the EDTA library.

The Kimura distance table is made up of the last 72 lines of the divsum file and can be extracted using

This then provides table which can be read into the R script from @KristinaGagalova above.

A compilation of the landscapes that I plotted are attached.

Repeat_Landscapes

I am interested in figuring out what is up with the large peak with a Kimura distance of 0 in each of the landscapes. Could it be some artifact of the process?

oushujun commented 3 years ago

Thanks for sharing your route! You may want to check the LTR Assembly Index to make sure these genomes are in good Quality for fair comparison. It seems to me that the lappo genome has higher quality than the aestiva genome.

-Shujun

On Fri, May 28, 2021 at 6:51 PM CraigMichell @.***> wrote:

Now that I have finally been able to come back to analysing the repeats almost a year later. I just wanted to thank everyone for their suggestions and help with this.

I thought I would outline the commands that I used in the end for generating a repeat landscape from the EDTA library.

  • singularity_wrapper exec EDTA.pl --genome genome.fasta --species others --step all --anno 1 --threads 16 --overwrite 1
  • RepeatMasker -pa 2 -s -a -inv -dir ./RepMask -no_is -norna -xsmall -nolow -div 40 -lib EDTA.TElib.fa -cutoff 225 genome.fasta
  • calcDivergenceFromAlign.pl -s genome.divsum genome.fasta.align

The Kimura distance table is made up of the last 72 lines of the divsum file and can be extracted using

  • tail -n 72 genome.divsum > genome.Kimura.distance

This then provides table which can be read into the R script from @KristinaGagalova https://github.com/KristinaGagalova above.

A compilation of the landscapes that I plotted are attached.

[image: Repeat_Landscapes] https://user-images.githubusercontent.com/39945819/119971780-1578e980-bfba-11eb-8148-17308e076cfc.jpg

I am interested in figuring out what is up with the large peak with a Kimura distance of 0 in each of the landscapes. Could it be some artifact of the process?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/92#issuecomment-850332587, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4ND67MAOSJIMU6UDMP3TP5YT7ANCNFSM4OTDDQUA .

KristinaGagalova commented 3 years ago

This looks like an interesting package - https://github.com/dwinter/repeatR

BenAawf commented 1 year ago

First of all, thank you all. I would appreciate it if any could help me understand what those values (numbers) mean in simple words : Div DNA/DTA DNA/DTC DNA/DTH DNA/DTM DNA/DTT DNA/Helitron LTR/Copia LTR/unknown MITE/DTA MITE/DTC MITE/DTH MITE/DTM MITE/DTT Simple_repeat 0 56501 195150 74881 366703 31760 265453 1597 8915 15679 6790 3767 19197 448 0 1 4129 18025 1400 25603 1085 26781 0 55 999 251 70 10725 0 0 2 7610 30725 3424 32479 761 18890 0 2067 3992 154 87 9642 0 0 3 13590 51664 6968 50551 1026 25304 627 1250 6389 1838 125 9066 0 0 4 21745 57150 10975 64491 3487 35745 1091 3717 11783 2957 310 9757 0 0 5 35444 65082 17207 80812 2897 48271 770 5976 13448 2987 840 10382 0 0

the above data is a head file from @KristinaGagalova dataset

Thank you

KristinaGagalova commented 1 year ago

Hi, these are the type of repeats that are annotated (broad definition) and their corresponding counts in the genome. It's basically a quantification per class. The div column shows the % divergence and it's the x-axis in Kimura plot

joannarifkin commented 10 months ago

https://github.com/jtlovell/GENESPACE might be helpful here too!

oushujun commented 7 months ago

By default now EDTA (v2.2.1) will draw the repeat landscape directly as one output. Here's what it looks like:

image

I am unpinning this thread for its retirement. Thank you everyone for providing codes and ideas for this topic!!

Shujun

ltswx commented 3 weeks ago

@KristinaGagalova

Thanks for sharing! but I'm confused ,I added up all the numbers in the divsum file except for the first column and divided by the size of the genome,;There is a huge difference between the results of the two, the result of the *EDTA file is 68.17%, and the result of my calculation is 98%!, I would like to ask where I misunderstood,Because I'm trying to draw a pie chart that shows the percentage of these transposons QAQ

*mod.EDTA.TEanno.sum: Repeat Classes

Total Sequences: 3487 Total Length: 1322312127 bp Class Count bpMasked %masked ===== ===== ======== ======= LTR -- -- --
Copia 105891 108126703 8.18% Gypsy 292804 391191137 29.58% unknown 182543 139936847 10.58% TIR -- -- --
CACTA 61380 26705216 2.02% Mutator 107653 114277135 8.64% PIF_Harbinger 28574 8112964 0.61% Tc1_Mariner 125476 35791242 2.71% hAT 46252 20984760 1.59% nonTIR -- -- --
helitron 148352 56359955 4.26%

total interspersed 1098925      901485959    68.17%

Total 1098925 901485959 68.17% —————————————————————————— tbl file form repeatmasker: total length: 1333551035 bp (1322312127 bp excl N/X-runs) GC level: 36.87 % bases masked: 946989405 bp ( 71.01 %) —————————————————————————— divsum file: Div DNA/DTA DNA/DTC DNA/DTH DNA/DTM DNA/DTT DNA/Helitron LTR/Copia LTR/Gypsy LTR/unknown MITE/DTA MITE/DTC MITE/DTH MITE/DTM MITE/DTT 0 2372488 1566743 448260 1399704 1862233 3080962 4090922 14477189 2989812 125802 12497 41037 54060 84787 1 1070198 565642 179819 699623 1098895 2032132 2420914 10119770 2200132 10879 422 4182 3855 8880 2 715860 855177 218688 527116 470136 824655 1995255 11309963 2177667 10787 217 13666 6496 22494 3 1035157 842382 208109 521912 349325 986025 3128677 13492012 2677114 34770 2122 23326 24197 34870 4 781227 794300 166625 727909 508392 1212696 5397677 12978613 3680115 38307 642 28485 42707 45816 5 1406210 769048 173994 906731 624432 1341063 4343808 15285779 4647044 52326 0 27940 63166 54184 6 770654 831030 197007 713560 703585 1545371 4483437 17083630 5650726 92590 2812 42051 40711 60726 7 726316 858708 154043 691578 791870 1792639 4613580 20229613 8154480 112670 1532 42463 33844 86128 8 681396 938817 168054 743208 952169 1854848 5054305 21196254 9894821 140268 2550 51597 43002 107128 9 604952 978478 167630 809106 1004107 2419183 4790173 20091691 11198972 180950 1613 55319 55831 127289 10 639736 990794 224385 941584 1166862 2545176 5273654 19311001 10555231 148188 1665 74280 64342 160882 11 620142 995990 229171 1005233 1330302 2645505 6078967 19495570 10611831 151829 1756 94766 81253 168221 12 596736 979397 262376 954856 1551576 3140240 6313904 18661570 10322872 151098 2935 100498 82606 191351 13 728465 910078 260739 934970 1566219 2568548 5546549 16320060 9657407 161395 1732 87144 69791 205104 14 696420 827986 224541 956190 1570759 2588673 4749201 15004961 9009895 148651 728 90783 81420 198661 15 669376 833180 231964 999698 1651570 2570283 5255747 13781616 7720243 150840 996 97759 90209 182999 16 626601 849532 220414 1187371 1608941 2571378 4553700 12602258 6722391 143630 982 87894 84362 168931 17 606426 820565 211516 1639720 1620009 2476461 4492894 11664936 4912886 141461 1416 95705 73054 169030 18 498113 818626 206386 3778260 1699336 2249800 4037018 10565373 3775558 126250 2077 95020 67744 157855 19 440143 853099 215911 13342788 1875211 2275643 3471026 9478961 3253924 109898 1334 89650 68048 141827 20 392077 921635 193810 42271944 1871490 2216555 2826372 9014891 2963606 106505 1689 75025 63812 136338 21 335578 906159 207703 74051399 1808891 2058996 2201190 8725154 2666056 93689 3071 77575 59535 137876 22 337415 814555 199042 77079780 1669412 2297319 2156443 9062914 2592126 96448 2413 68705 47452 127404 23 323405 841229 220502 50788912 1512184 1951457 2198345 8346377 2718104 81280 2700 61415 41378 118290 24 291419 945761 195468 22965594 1384568 1826849 1968953 7620308 2271787 83316 1625 59250 42139 119858 25 282724 969595 185307 8630729 1259230 1717695 2008782 7822866 1933346 75091 1906 63850 45515 112735 26 258653 1137225 187122 3661803 1227592 1717284 2137466 8457821 1745714 71195 3165 58693 39017 106227 27 219445 1161779 195278 2189846 1125775 1608901 2016681 10016293 1756664 65509 2889 57048 40999 96926 28 231243 899960 220493 1680100 1043300 1454815 1593123 10659534 1419714 61426 1396 51211 35632 90255 29 196252 945695 179355 1347855 944943 1367978 1354270 10214950 931468 54643 1544 45752 35158 79617 30 166707 1050287 151917 1082359 857792 1246436 1140561 10337721 719851 46964 3272 42227 30448 67622 31 148241 1082000 121590 942409 751039 1241058 1172321 11160153 684410 40465 1249 36663 29725 60376 32 120702 1228329 107612 798325 688518 1184030 1591327 11458913 522652 31872 2041 28835 25363 50892 33 93286 1346042 102054 808583 607963 1215827 3500491 15885122 398462 30841 2728 26558 18758 33776 34 88289 1300309 92674 864415 532755 940712 3111923 12654879 318864 23740 1587 25872 13842 34598 35 65974 1239054 88656 717085 457126 895603 1933532 6521390 253536 17706 1642 23493 14958 22788 36 62436 1009345 72722 596948 378460 871168 1322333 4951858 186387 14138 1756 17894 14374 16911 37 59819 964493 61552 598119 342657 909615 1271040 4205225 152514 12546 1540 12958 7619 11667 38 71842 827383 58303 510491 281851 876939 816970 3088871 114682 10054 2013 9517 6057 7412 39 80441 600820 43224 446275 242803 844172 475190 2727504 83851 5793 945 6229 4485 7358 40 85481 532881 31149 452050 199443 785084 411575 2318517 64842 7028 1331 5331 4348 5314 41 71160 394552 23060 450610 197408 745368 383154 1716829 57607 5927 621 4419 3544 5126 42 66245 301210 29843 423033 170501 642513 474986 1313448 40232 2742 1318 3570 3048 2752 43 45227 246635 15450 367600 153636 548760 446599 1091995 28303 2838 1858 1500 2678 970 44 44203 188813 12042 292453 135443 469329 258269 822704 18425 1827 442 993 1723 755 45 35052 125046 8460 275099 70085 369097 165378 600995 19951 974 1237 2323 1396 372 46 26973 83028 4980 176374 57059 247696 88752 456999 14717 923 1121 906 357 618 47 32331 56993 2486 107917 49920 170429 43761 281656 8232 397 629 344 144 90 48 20182 33233 2104 64963 31659 104414 21335 209202 5157 479 455 256 138 0 49 7346 19459 1424 51345 15897 66061 8467 147324 3541 35 0 0 0 241 50 2521 10815 714 18920 6571 46435 4718 76488 2311 0 0 0 0 108 51 2478 5220 401 13105 6192 24298 1617 33519 1250 0 0 0 0 0 52 791 4161 300 9694 5093 19305 1071 15578 798 0 0 170 0 0 53 886 2615 128 2635 984 10098 256 7948 988 168 0 0 0 0 54 440 391 0 2457 925 2453 849 3052 1432 166 0 0 0 0 55 0 217 0 632 0 2198 39 2675 777 0 0 0 0 0 56 0 517 0 245 0 900 0 161 631 0 0 0 0 0 57 0 230 0 0 0 507 78 38 390 0 0 0 0 0 58 0 0 0 0 0 154 151 0 1169 0 0 0 84 0 59 0 0 0 0 0 0 107 0 87 0 0 0 0 0 60 0 0 0 0 0 259 0 338 216 0 0 0 0 0 61 0 0 0 0 0 0 137 159 148 0 0 0 0 0 62 0 0 0 123 0 0 0 17 276 0 0 0 0 0 63 0 0 0 0 0 0 9 0 586 0 0 0 0 0 64 0 0 0 0 0 0 0 0 179 0 0 0 0 0 65 0 0 0 0 0 0 0 0 0 0 0 0 0 0 66 0 0 0 0 0 0 0 0 144 0 0 0 0 0 67 0 0 0 0 0 0 0 0 0 0 0 0 0 0 68 0 0 0 0 0 0 38 0 125 0 0 0 0 0 69 0 0 0 0 0 0 11 0 0 0 0 0 0 0 70 0 0 0 0 0 0 18 0 0 0 0 0 0 0