oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
346 stars 73 forks source link

Nomenclature discrepancy? #22

Closed ghost closed 5 years ago

ghost commented 5 years ago

Hi, Here are the count from the TE library genome.FLYE.sixLongest.fa.EDTA.TElib.fa

DNA/DTA 52
DNA/DTC 50
DNA/DTH 476
DNA/DTM 654
DNA/DTT 2722
DNA/Helitron    15
LTR/Gypsy   38
LTR/unknown 20
MITE/DTA    75
MITE/DTC    10
MITE/DTH    88
MITE/DTM    104
MITE/DTT    570

Then I ran RepeatMasker RepeatMasker genome.FLYE.sixLongest.fa -no_is -pa 8 -lib genome.FLYE.sixLongest.fa.EDTA.TElib.fa

Here is the summary

==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements         1333       187637 bp    0.16 %
   SINEs:               20         1160 bp    0.00 %
   Penelope             63         3689 bp    0.00 %
   LINEs:              487        62803 bp    0.05 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex         12          561 bp    0.00 %
     R1/LOA/Jockey      23         2819 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B          50        23094 bp    0.02 %
     L1/CIN4           177        20812 bp    0.02 %
   LTR elements:       826       123674 bp    0.11 %
     BEL/Pao           105         7431 bp    0.01 %
     Ty1/Copia           2          131 bp    0.00 %
     Gypsy/DIRS1       256        55114 bp    0.05 %
       Retroviral      179        10844 bp    0.01 %

DNA transposons       2314       176348 bp    0.15 %
   hobo-Activator      689        43072 bp    0.04 %
   Tc1-IS630-Pogo      167        54954 bp    0.05 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac             18         2279 bp    0.00 %
   Tourist/Harbinger   249        12509 bp    0.01 %
   Other (Mirage,       24         1231 bp    0.00 %
    P-element, Transib)

Rolling-circles         77         8371 bp    0.01 %

Unclassified:           51         3907 bp    0.00 %

Total interspersed repeats:      367892 bp    0.32 %

Small RNA:             431       137483 bp    0.12 %

Satellites:            130         7935 bp    0.01 %
Simple repeats:      48930      1869437 bp    1.61 %
Low complexity:       9266       432567 bp    0.37 %
==================================================

The number for the DNA transposons do not seem to match. For example, I have more DNA elements reported from the non-redundant EDTA output than from RepeatMasker, but I would expect the opposite since RepeatMasker should count the occurrence of each element. Or am I missing something?

oushujun commented 5 years ago

The discrepancy is due to the default assumption that this is a human genome. Please find other RepeatMasker-related issues in this repo and there are some solutions.

On Sat, Sep 21, 2019, 6:33 AM aderzelle notifications@github.com wrote:

Hi, Here are the count from the TE library genome.FLYE.sixLongest.fa.EDTA.TElib.fa

DNA/DTA 52 DNA/DTC 50 DNA/DTH 476 DNA/DTM 654 DNA/DTT 2722 DNA/Helitron 15 LTR/Gypsy 38 LTR/unknown 20 MITE/DTA 75 MITE/DTC 10 MITE/DTH 88 MITE/DTM 104 MITE/DTT 570

Then I ran RepeatMasker RepeatMasker genome.FLYE.sixLongest.fa -no_is -pa 8 -lib genome.FLYE.sixLongest.fa.EDTA.TElib.fa

Here is the summary

================================================== number of length percentage elements* occupied of sequence

Retroelements 1333 187637 bp 0.16 % SINEs: 20 1160 bp 0.00 % Penelope 63 3689 bp 0.00 % LINEs: 487 62803 bp 0.05 % CRE/SLACS 0 0 bp 0.00 % L2/CR1/Rex 12 561 bp 0.00 % R1/LOA/Jockey 23 2819 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 50 23094 bp 0.02 % L1/CIN4 177 20812 bp 0.02 % LTR elements: 826 123674 bp 0.11 % BEL/Pao 105 7431 bp 0.01 % Ty1/Copia 2 131 bp 0.00 % Gypsy/DIRS1 256 55114 bp 0.05 % Retroviral 179 10844 bp 0.01 %

DNA transposons 2314 176348 bp 0.15 % hobo-Activator 689 43072 bp 0.04 % Tc1-IS630-Pogo 167 54954 bp 0.05 % En-Spm 0 0 bp 0.00 % MuDR-IS905 0 0 bp 0.00 % PiggyBac 18 2279 bp 0.00 % Tourist/Harbinger 249 12509 bp 0.01 % Other (Mirage, 24 1231 bp 0.00 % P-element, Transib)

Rolling-circles 77 8371 bp 0.01 %

Unclassified: 51 3907 bp 0.00 %

Total interspersed repeats: 367892 bp 0.32 %

Small RNA: 431 137483 bp 0.12 %

Satellites: 130 7935 bp 0.01 % Simple repeats: 48930 1869437 bp 1.61 % Low complexity: 9266 432567 bp 0.37 %

The number for the DNA transposons do not seem to match. For example, I have more DNA elements reported from the non-redundant EDTA output than from RepeatMasker, but I would expect the opposite since RepeatMasker should count the occurrence of each element. Or am I missing something?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/22?email_source=notifications&email_token=ABNX4NCEK2C2E6NTZQ7NVZDQKYBBRA5CNFSM4IY557WKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HM2HC3A, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNX4NHTWBCW4CVCYGBFFCDQKYBBRANCNFSM4IY557WA .

oushujun commented 5 years ago

Solutions could be found in #8