oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
336 stars 73 forks source link

No nonLTR elements detected by repeatmodeler ? #58

Closed Tkastylevsky closed 4 years ago

Tkastylevsky commented 4 years ago

Hello, it's me again ! so, EDTA finished its run on my chicken chr1. However, I ran into a bit of an issue...No non LTR elements were detected by the analysis, even when I used the --sensitive 1 setting(and I checked, I have a repeatmodeler folder with 6 rounds of library construction in it in my EDTA run folder). However, avian genomes are known to have been invaded by in particular LINE CR1 elements :

Repeat Classes
==============
Total Sequences: 1
Total Length: 197097634 bp
Class                  Count        bpMasked    %masked
=====                  =====        ========     =======
DNA                    --           --           --   
    DTA                2180         413133       0.21% 
    DTC                36660        8762086      4.45% 
    DTH                3253         1154128      0.59% 
    DTM                16167        4728318      2.40% 
    DTT                4112         738084       0.37% 
    Helitron           821          302918       0.15% 
LTR                    --           --           --   
    Gypsy              2522         1340228      0.68% 
    unknown            9655         3943204      2.00% 
MITE                   --           --           --   
    DTA                7            1314         0.00% 
    DTC                158          25521        0.01% 
    DTH                131          12684        0.01% 
    DTM                107          19953        0.01% 
    DTT                9            1783         0.00% 
                      ---------------------------------
    total interspersed 75782        21443354     10.88%

---------------------------------------------------------
Total                  75782        21443354     10.88%

As I am testing several annotation methods I have used Repeatmodeler alone on this chromosome and this is what I got after repeatmasking (i'm fusing the repeatmodeler database obtained with Dfam) :

==================================================
file name: chr1.fa                  
sequences:             1
total length:  197608386 bp  (197097786 bp excl N/X-runs)
GC level:         40.39 %
bases masked:   24048722 bp ( 12.17 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        54270     18652277 bp    9.44 %
   SINEs:             1082       137415 bp    0.07 %
   Penelope            253        18722 bp    0.01 %
   LINEs:            50760     17981073 bp    9.10 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex      50010     17913555 bp    9.07 %
     R1/LOA/Jockey      17         1051 bp    0.00 %
     R2/R4/NeSL          3          175 bp    0.00 %
     RTE/Bov-B          88        23376 bp    0.01 %
     L1/CIN4           317        19155 bp    0.01 %
   LTR elements:      2428       533789 bp    0.27 %
     BEL/Pao           106         7057 bp    0.00 %
     Ty1/Copia           7          380 bp    0.00 %
     Gypsy/DIRS1       300        20536 bp    0.01 %
       Retroviral     1869       492771 bp    0.25 %

DNA transposons       5853       767141 bp    0.39 %
   hobo-Activator     2068       393668 bp    0.20 %
   Tc1-IS630-Pogo      516        79589 bp    0.04 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac             26         1303 bp    0.00 %
   Tourist/Harbinger   463        53674 bp    0.03 %
   Other (Mirage,       54         2441 bp    0.00 %
    P-element, Transib)

Rolling-circles        171        10064 bp    0.01 %

Unclassified:         3069      1028338 bp    0.52 %

Total interspersed repeats:    20447756 bp   10.35 %

Small RNA:             395        52152 bp    0.03 %

Satellites:            242        26450 bp    0.01 %
Simple repeats:      64150      2958427 bp    1.50 %
Low complexity:      10768       593273 bp    0.30 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element

RepeatMasker Combined Database: Dfam_3.1

run with rmblastn version 2.10.0+
The query was compared to classified sequences in ".../lib_gal_chr1_rmsk_combined.fasta"

So, why do you think there is such a huge difference between both annotations ? Am I looking into the right files for the results of EDTA ?

PS : the code i used to run EDTA after the EDTA_raw.pl code :

#!/bin/bash 
#SBATCH --job-name=EDTA_galgal6_chr1.fa
#SBATCH --partition=normal
#SBATCH --output=/beegfs/data/tkastylevsky/scripts/subscripts/std_out_EDTA_galgal6_chr1.fa
#SBATCH --error=/beegfs/data/tkastylevsky/scripts/subscripts/std_err_EDTA_galgal6_chr1.fa
#SBATCH --cpus-per-task=32
#SBATCH --time=100:00:00
#SBATCH --mem=10G
source /beegfs/home/tkastylevsky/.bashrc
conda activate EDTA
cd /beegfs/data/tkastylevsky/genomes/EDTA/gallus_gallus/chr1/
EDTA.pl --genome galgal6_chr1.fa --anno 1 --sensitive 1 --threads 32

Timothee

EDIT : (I removed some of the results from EDTA for readability purposes since this thread is getting attention)

oushujun commented 4 years ago

Dear Timothee,

Thanks for the information. You need to do your own curation in this case. From the two annotations, the major difference is the amount of DTC, DTM and L2 elements. You can check a couple of DTC and DTM candidates as well as some of the L2 elements.

You can also rerun EDTA with --step anno --anno 1 --evaluate 1 and get the annotation consistency report. You may also want to compare the RepeatModeler library with the same evaluation strategy. I made the code available for you: /util/evaluate.pl, please update to v1.8.1 and give it a try.

Best, Shujun

weedcentipede commented 4 years ago

Hey Shujun, I ran EDTA with --sensitive 1 --anno 1 --evaluate 1 --species others --step all over few fungal genomes and an insect species and I got none nonLTR elements

This are the summaries of some of the fungal genomes: Drechmeria sp

Repeat Classes
==============
Total Sequences: 18
Total Length: 32193095 bp
Class                  Count        bpMasked    %masked
=====                  =====        ========     =======
DNA                    --           --           --
    DTA                6            9527         0.03%
    DTC                76           71867        0.22%
    DTM                985          583110       1.81%
    DTT                167          100394       0.31%
    Helitron           22           53063        0.16%
LTR                    --           --           --
    Copia              874          587266       1.82%
    Gypsy              291          212725       0.66%
    unknown            1549         1206621      3.75%
MITE                   --           --           --
    DTC                2            401          0.00%
    DTM                104          40727        0.13%
    DTT                15           2003         0.01%
                      ---------------------------------
    total interspersed 4091         2867704      8.91%

---------------------------------------------------------
Total                  4091         2867704      8.91%

Hirsutella sp

Repeat Classes
==============
Total Sequences: 666
Total Length: 49866516 bp
Class                  Count        bpMasked    %masked
=====                  =====        ========     =======
DNA                    --           --           --
    DTA                634          359005       0.72%
    DTC                580          214154       0.43%
    DTH                780          425971       0.85%
    DTM                3249         1740285      3.49%
    DTT                94           42968        0.09%
    Helitron           862          327494       0.66%
LTR                    --           --           --
    Copia              1219         677988       1.36%
    Gypsy              2630         1166054      2.34%
    unknown            871          320474       0.64%
MITE                   --           --           --
    DTA                5            564          0.00%
    DTC                83           18513        0.04%
    DTH                36           10855        0.02%
    DTM                280          41327        0.08%
                      ---------------------------------
    total interspersed 11323        5345652      10.72%

---------------------------------------------------------
Total                  11323        5345652      10.72%

Purpureocillium sp


==============
Total Sequences: 51
Total Length: 36205685 bp
Class                  Count        bpMasked    %masked
=====                  =====        ========     =======
DNA                    --           --           --
    DTA                6            9359         0.03%
    DTC                33           104602       0.29%
    DTH                1            3356         0.01%
    DTM                43           114082       0.32%
    DTT                1            1907         0.01%
    Helitron           3            35300        0.10%
LTR                    --           --           --
    Copia              2            5059         0.01%
MITE                   --           --           --
    DTA                1            442          0.00%
    DTC                6            2505         0.01%
    DTM                19           6602         0.02%
                      ---------------------------------
    total interspersed 115          283214       0.78%

---------------------------------------------------------
Total                  115          283214       0.78%

And this is for the insect genome (Sitophilus oryzae)

Repeat Classes
==============
Total Sequences: 2015
Total Length: 757919329 bp
Class                  Count        bpMasked    %masked
=====                  =====        ========     =======
DNA                    --           --           --
    DTA                579504       171226892    22.59%
    DTC                111844       23363926     3.08%
    DTH                47082        14477368     1.91%
    DTM                370873       85792038     11.32%
    DTT                11517        4012089      0.53%
    Helitron           123027       28213518     3.72%
LTR                    --           --           --
    Copia              2226         1076135      0.14%
    Gypsy              69855        33203730     4.38%
    unknown            318040       100348223    13.24%
MITE                   --           --           --
    DTA                35082        5821314      0.77%
    DTC                7302         1081575      0.14%
    DTH                1232         230451       0.03%
    DTM                23103        4626935      0.61%
    DTT                1364         184520       0.02%
                      ---------------------------------
    total interspersed 1702051      473658714    62.49%

---------------------------------------------------------
Total                  1702051      473658714    62.49%

Is there any recommendation for fixing this? Cheers, Luis Alfonso.

oushujun commented 4 years ago

Hi Luis,

So far I have not identified an approach to confidently annotate nonLTR elements automatically. You may try out different structural-based methods and manually curate the results, then provide them to EDTA with --curatedlib as known TEs.

The --sensitive 1 option is to use repeatmodeler to identify remaining repetitive sequences with some classifications, which is not at all dedicated for nonLTR identifications and should have a very high false negative rate.

Hope these helps.

Best, Shujun

RRebo commented 4 years ago

Hello Shujun, I work with Luis, and I don't understand why no LINEs are picked up by RM? Is there a way we could change the code so RM gets the LINE+SINE annotations as it would do normally? Thanks Rita

oushujun commented 4 years ago

Hi Rita,

I am very sorry for the delayed reply. RM is serving as a TE scavenger here to try to pick up unannotated repetitive sequences after the annotation using the EDTA-filter library. It's likely that SINEs and LINEs are in low abundance and nested in other TEs, so that their representation after the first pass annotation are reduced below the detection limit of RM. It is also likely the SINE and LINE detection module of RM is not as sensitive as you are expecting. Again RM is not designed for dedicated nonLTR annotations.

As suggested previously, you may annotate nonLTRs using other methods and manually curated them, and provided the manually curated nonLTR library to EDTA.

Best, Shujun

ptranvan commented 4 years ago

I got the same issues.

There are too much disparities with RepeatModeler/RepeatMasker and I don't detect any non LTR with EDTA ..

oushujun commented 4 years ago

Looks like there is some bugs with RepeatModeler. I will look into it further.

Shujun

On Thu, Mar 19, 2020 at 10:29 AM Patrick Tran Van notifications@github.com wrote:

I got the same issues.

There too much disparities with RepeatModeler/RepeatMasker and I don't detect any non LTR with EDTA ..

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/58#issuecomment-601244280, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NDD2ERZ4WNO6X4FMFTRII26JANCNFSM4KVEECPA .

baozg commented 4 years ago

Hi, @oushujun @Tkastylevsky @OnlyHigh @RRebo @ptranvan

The issue is the RepeatMasker in conda,it need a library to classify the consei.fa.I use the manually installed RepeatModeler (version 1.0.11) and RepeatMasker(version 4.0.9) to replace the conda one.

The consensi.fa was classified by the RepeatClassifier, and generated consensi.fa.classified, the rename_RM_TE.pl was change name according to consensi.fa.classified, if you checked the genome.fa.mod.RepeatModeler.raw.fa it should be empty if you use the conda version RepeatModeler. RepeatClassifier of the conda version will throw a error

# Error of conda RepeatClassifier
RepeatClassifier Version 2.0.1
======================================
Search Engine = rmblast
  - Looking for Simple and Low Complexity sequences..
  - Looking for similarity to known repeat proteins..
  - Looking for similarity to known repeat consensi..
Missing /data/software/conda_envs/EDTA/share/RepeatMasker/Libraries/RepeatMasker.lib.nsq!
Please rerun the configure program in the RepeatModeler directory
before running this script.

# After replace the RepeatModeler and RepeatMasker, the tbl show the nonLTR elements
Repeat Classes
==============
Total Sequences: 15
Total Length: 465993302 bp
Class                  Count        bpMasked    %masked
=====                  =====        ========     =======
DNA                    449          63335        0.01%
    Academ-H           331          56230        0.01%
    CMC-EnSpm          2047         566318       0.12%
    CMC-Transib        1350         57141        0.01%
    DTA                67007        11386924     2.44%
    DTC                103511       16836805     3.61%
    DTH                23006        3521116      0.76%
    DTM                170985       27421409     5.88%
    DTT                39504        5397629      1.16%
    Helitron           206020       43936749     9.43%
    IS3EU              152          9504         0.00%
    MULE-MuDR          2833         989555       0.21%
    Maverick           459          136535       0.03%
    Sola-3             2            130          0.00%
    TcMar-Tc1          46           12764        0.00%
LINE                   --           --           --
    I-Jockey           22           10552        0.00%
    L1                 2241         904672       0.19%
    L1-Tx1             172          44425        0.01%
    L2                 289          59534        0.01%
    Penelope           45           21823        0.00%
    R1                 27           38794        0.01%
    RTE-BovB           57           29446        0.01%
    RTE-X              70           10261        0.00%
LTR                    --           --           --
    Copia              63804        29281157     6.28%
    Gypsy              75449        25996361     5.58%
    Unknown            5668         1417264      0.30%
    unknown            160676       45748080     9.82%
MITE                   --           --           --
    DTA                16500        2172299      0.47%
    DTC                3021         305422       0.07%
    DTH                12724        1893596      0.41%
    DTM                36734        5512094      1.18%
    DTT                2764         273700       0.06%
SINE                   1173         252833       0.05%
Unknown                68810        14544934     3.12%
                      ---------------------------------
    total interspersed 1067948      238909391    51.27%

---------------------------------------------------------
Total                  1067948      238909391    51.27%
ptranvan commented 4 years ago

Thanks for looking at it. I have install EDTA with this command:

conda install -c bioconda -c conda-forge edta

Do you know any easy solution to replace the conda Repeatmodeler with an other version ?

I saw there is a manual command:

conda install -n EDTA -y cd-hit repeatmodeler muscle mdust blast openjdk perl perl-text-soundex multiprocess regex tensorflow=1.14.0 keras=2.2.4 scikit-learn=0.19.0 biopython pandas glob2 python=3.6 tesorter genericrepeatfinder genometools-genometools ltr_retriever ltr_finder numpy=1.16.4

But I don't see RepeatMasker. Is your command up to date ? and if yes, I just need to run this command without repeatmodeler right ?

oushujun commented 4 years ago

@ptranvan I am working on this bug and will have it fixed in the next update.

Shujun

Juke34 commented 4 years ago

I didn't include any lib (except the Transposable element protein database which is by default in the repo) in the RepeatMasker recipe (RepeatMasker is used by repeatModeler to classify the detected repeats). In theory you can choose to pay a licence for Repbase on go for a free solution like Dfam. I explained it here: https://github.com/bioconda/bioconda-recipes/issues/9988#issuecomment-565410213

I could include by default the Dfam one by default in an updated version of the recipe

oushujun commented 4 years ago

@Juke34 I have worked around the classification of RM2 results in EDTA using TEsorter. It will be reflected in the next update. Outside of EDTA I think including some sort of classification scheme would benefit the end-user.

oushujun commented 4 years ago

@Tkastylevsky Please update EDTA and rerun it with the --anno 1 --step anno parameters.

ptranvan commented 4 years ago

@oushujun can we update edta using the conda command ?

conda install -c bioconda -c conda-forge edta

oushujun commented 4 years ago

@oushujun can we update edta using the conda command ?

conda install -c bioconda -c conda-forge edta

Not for now, because the new update has not been pushed to conda yet. You can git clone the current repository and activate the conda environment. By specifying the path to the cloned EDTA or export it to $ENV, the new EDTA is ready to go.

Tkastylevsky commented 4 years ago
Repeat Classes
==============
Total Sequences: 1
Total Length: 197097634 bp
Class                  Count        bpMasked    %masked
=====                  =====        ========     =======
DNA                    --           --           --   
    DTA                2154         403212       0.20% 
    DTC                35980        8736131      4.43% 
    DTH                3188         1153531      0.59% 
    DTM                15882        4704588      2.39% 
    DTT                4208         763646       0.39% 
    Helitron           798          302503       0.15% 
LINE                   --           --           --   
    unknown            1210         464085       0.24% 
LTR                    --           --           --   
    Gypsy              2374         1294633      0.66% 
    unknown            9554         3965695      2.01% 
MITE                   --           --           --   
    DTA                4            1314         0.00% 
    DTC                124          25886        0.01% 
    DTH                125          12684        0.01% 
    DTM                81           20091        0.01% 
    DTT                6            1783         0.00% 
TIR                    --           --           --   
    hAT                60           17222        0.01% 
Unknown                7671         2299478      1.17% 
                      ---------------------------------
    total interspersed 83419        24166482     12.26%

---------------------------------------------------------
Total                  83419        24166482     12.26%

here is the result : so, a few LINEs were detected, but it is still very weird that so few are picked by the analysis...In my other annotations, lines can cover as much as 8% of this chromosome by themselves.

oushujun commented 4 years ago

@Tkastylevsky Without manual curation it's difficult to say whether 0.24% is due to false-negative or 8% is due to false positive. More likely it's both.

oushujun commented 4 years ago

The conda EDTA has also been updated to v1.8.3. I will mark this issue solved. Feel free to reopen if necessary.