Closed Tkastylevsky closed 4 years ago
Dear Timothee,
Thanks for the information. You need to do your own curation in this case. From the two annotations, the major difference is the amount of DTC, DTM and L2 elements. You can check a couple of DTC and DTM candidates as well as some of the L2 elements.
You can also rerun EDTA with --step anno --anno 1 --evaluate 1
and get the annotation consistency report. You may also want to compare the RepeatModeler library with the same evaluation strategy. I made the code available for you: /util/evaluate.pl
, please update to v1.8.1 and give it a try.
Best, Shujun
Hey Shujun,
I ran EDTA with --sensitive 1 --anno 1 --evaluate 1 --species others --step all
over few fungal genomes and an insect species and I got none nonLTR elements
This are the summaries of some of the fungal genomes: Drechmeria sp
Repeat Classes
==============
Total Sequences: 18
Total Length: 32193095 bp
Class Count bpMasked %masked
===== ===== ======== =======
DNA -- -- --
DTA 6 9527 0.03%
DTC 76 71867 0.22%
DTM 985 583110 1.81%
DTT 167 100394 0.31%
Helitron 22 53063 0.16%
LTR -- -- --
Copia 874 587266 1.82%
Gypsy 291 212725 0.66%
unknown 1549 1206621 3.75%
MITE -- -- --
DTC 2 401 0.00%
DTM 104 40727 0.13%
DTT 15 2003 0.01%
---------------------------------
total interspersed 4091 2867704 8.91%
---------------------------------------------------------
Total 4091 2867704 8.91%
Hirsutella sp
Repeat Classes
==============
Total Sequences: 666
Total Length: 49866516 bp
Class Count bpMasked %masked
===== ===== ======== =======
DNA -- -- --
DTA 634 359005 0.72%
DTC 580 214154 0.43%
DTH 780 425971 0.85%
DTM 3249 1740285 3.49%
DTT 94 42968 0.09%
Helitron 862 327494 0.66%
LTR -- -- --
Copia 1219 677988 1.36%
Gypsy 2630 1166054 2.34%
unknown 871 320474 0.64%
MITE -- -- --
DTA 5 564 0.00%
DTC 83 18513 0.04%
DTH 36 10855 0.02%
DTM 280 41327 0.08%
---------------------------------
total interspersed 11323 5345652 10.72%
---------------------------------------------------------
Total 11323 5345652 10.72%
Purpureocillium sp
==============
Total Sequences: 51
Total Length: 36205685 bp
Class Count bpMasked %masked
===== ===== ======== =======
DNA -- -- --
DTA 6 9359 0.03%
DTC 33 104602 0.29%
DTH 1 3356 0.01%
DTM 43 114082 0.32%
DTT 1 1907 0.01%
Helitron 3 35300 0.10%
LTR -- -- --
Copia 2 5059 0.01%
MITE -- -- --
DTA 1 442 0.00%
DTC 6 2505 0.01%
DTM 19 6602 0.02%
---------------------------------
total interspersed 115 283214 0.78%
---------------------------------------------------------
Total 115 283214 0.78%
And this is for the insect genome (Sitophilus oryzae)
Repeat Classes
==============
Total Sequences: 2015
Total Length: 757919329 bp
Class Count bpMasked %masked
===== ===== ======== =======
DNA -- -- --
DTA 579504 171226892 22.59%
DTC 111844 23363926 3.08%
DTH 47082 14477368 1.91%
DTM 370873 85792038 11.32%
DTT 11517 4012089 0.53%
Helitron 123027 28213518 3.72%
LTR -- -- --
Copia 2226 1076135 0.14%
Gypsy 69855 33203730 4.38%
unknown 318040 100348223 13.24%
MITE -- -- --
DTA 35082 5821314 0.77%
DTC 7302 1081575 0.14%
DTH 1232 230451 0.03%
DTM 23103 4626935 0.61%
DTT 1364 184520 0.02%
---------------------------------
total interspersed 1702051 473658714 62.49%
---------------------------------------------------------
Total 1702051 473658714 62.49%
Is there any recommendation for fixing this? Cheers, Luis Alfonso.
Hi Luis,
So far I have not identified an approach to confidently annotate nonLTR elements automatically. You may try out different structural-based methods and manually curate the results, then provide them to EDTA with --curatedlib
as known TEs.
The --sensitive 1
option is to use repeatmodeler
to identify remaining repetitive sequences with some classifications, which is not at all dedicated for nonLTR identifications and should have a very high false negative rate.
Hope these helps.
Best, Shujun
Hello Shujun, I work with Luis, and I don't understand why no LINEs are picked up by RM? Is there a way we could change the code so RM gets the LINE+SINE annotations as it would do normally? Thanks Rita
Hi Rita,
I am very sorry for the delayed reply. RM is serving as a TE scavenger here to try to pick up unannotated repetitive sequences after the annotation using the EDTA-filter library. It's likely that SINEs and LINEs are in low abundance and nested in other TEs, so that their representation after the first pass annotation are reduced below the detection limit of RM. It is also likely the SINE and LINE detection module of RM is not as sensitive as you are expecting. Again RM is not designed for dedicated nonLTR annotations.
As suggested previously, you may annotate nonLTRs using other methods and manually curated them, and provided the manually curated nonLTR library to EDTA.
Best, Shujun
I got the same issues.
There are too much disparities with RepeatModeler/RepeatMasker and I don't detect any non LTR with EDTA ..
Looks like there is some bugs with RepeatModeler. I will look into it further.
Shujun
On Thu, Mar 19, 2020 at 10:29 AM Patrick Tran Van notifications@github.com wrote:
I got the same issues.
There too much disparities with RepeatModeler/RepeatMasker and I don't detect any non LTR with EDTA ..
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/58#issuecomment-601244280, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NDD2ERZ4WNO6X4FMFTRII26JANCNFSM4KVEECPA .
Hi, @oushujun @Tkastylevsky @OnlyHigh @RRebo @ptranvan
The issue is the RepeatMasker
in conda,it need a library to classify the consei.fa.I use the manually installed RepeatModeler
(version 1.0.11) and RepeatMasker
(version 4.0.9) to replace the conda one.
The consensi.fa
was classified by the RepeatClassifier
, and generated consensi.fa.classified
, the rename_RM_TE.pl
was change name according to consensi.fa.classified
, if you checked the genome.fa.mod.RepeatModeler.raw.fa
it should be empty if you use the conda version RepeatModeler. RepeatClassifier
of the conda version will throw a error
# Error of conda RepeatClassifier
RepeatClassifier Version 2.0.1
======================================
Search Engine = rmblast
- Looking for Simple and Low Complexity sequences..
- Looking for similarity to known repeat proteins..
- Looking for similarity to known repeat consensi..
Missing /data/software/conda_envs/EDTA/share/RepeatMasker/Libraries/RepeatMasker.lib.nsq!
Please rerun the configure program in the RepeatModeler directory
before running this script.
# After replace the RepeatModeler and RepeatMasker, the tbl show the nonLTR elements
Repeat Classes
==============
Total Sequences: 15
Total Length: 465993302 bp
Class Count bpMasked %masked
===== ===== ======== =======
DNA 449 63335 0.01%
Academ-H 331 56230 0.01%
CMC-EnSpm 2047 566318 0.12%
CMC-Transib 1350 57141 0.01%
DTA 67007 11386924 2.44%
DTC 103511 16836805 3.61%
DTH 23006 3521116 0.76%
DTM 170985 27421409 5.88%
DTT 39504 5397629 1.16%
Helitron 206020 43936749 9.43%
IS3EU 152 9504 0.00%
MULE-MuDR 2833 989555 0.21%
Maverick 459 136535 0.03%
Sola-3 2 130 0.00%
TcMar-Tc1 46 12764 0.00%
LINE -- -- --
I-Jockey 22 10552 0.00%
L1 2241 904672 0.19%
L1-Tx1 172 44425 0.01%
L2 289 59534 0.01%
Penelope 45 21823 0.00%
R1 27 38794 0.01%
RTE-BovB 57 29446 0.01%
RTE-X 70 10261 0.00%
LTR -- -- --
Copia 63804 29281157 6.28%
Gypsy 75449 25996361 5.58%
Unknown 5668 1417264 0.30%
unknown 160676 45748080 9.82%
MITE -- -- --
DTA 16500 2172299 0.47%
DTC 3021 305422 0.07%
DTH 12724 1893596 0.41%
DTM 36734 5512094 1.18%
DTT 2764 273700 0.06%
SINE 1173 252833 0.05%
Unknown 68810 14544934 3.12%
---------------------------------
total interspersed 1067948 238909391 51.27%
---------------------------------------------------------
Total 1067948 238909391 51.27%
Thanks for looking at it. I have install EDTA with this command:
conda install -c bioconda -c conda-forge edta
Do you know any easy solution to replace the conda Repeatmodeler with an other version ?
I saw there is a manual command:
conda install -n EDTA -y cd-hit repeatmodeler muscle mdust blast openjdk perl perl-text-soundex multiprocess regex tensorflow=1.14.0 keras=2.2.4 scikit-learn=0.19.0 biopython pandas glob2 python=3.6 tesorter genericrepeatfinder genometools-genometools ltr_retriever ltr_finder numpy=1.16.4
But I don't see RepeatMasker
. Is your command up to date ? and if yes, I just need to run this command without repeatmodeler
right ?
@ptranvan I am working on this bug and will have it fixed in the next update.
Shujun
I didn't include any lib (except the Transposable element protein database which is by default in the repo) in the RepeatMasker recipe (RepeatMasker is used by repeatModeler to classify the detected repeats). In theory you can choose to pay a licence for Repbase on go for a free solution like Dfam. I explained it here: https://github.com/bioconda/bioconda-recipes/issues/9988#issuecomment-565410213
I could include by default the Dfam one by default in an updated version of the recipe
@Juke34 I have worked around the classification of RM2 results in EDTA using TEsorter. It will be reflected in the next update. Outside of EDTA I think including some sort of classification scheme would benefit the end-user.
@Tkastylevsky Please update EDTA and rerun it with the --anno 1 --step anno
parameters.
@oushujun can we update edta using the conda command ?
conda install -c bioconda -c conda-forge edta
@oushujun can we update edta using the conda command ?
conda install -c bioconda -c conda-forge edta
Not for now, because the new update has not been pushed to conda yet. You can git clone
the current repository and activate the conda environment. By specifying the path to the cloned EDTA or export it to $ENV, the new EDTA is ready to go.
Repeat Classes
==============
Total Sequences: 1
Total Length: 197097634 bp
Class Count bpMasked %masked
===== ===== ======== =======
DNA -- -- --
DTA 2154 403212 0.20%
DTC 35980 8736131 4.43%
DTH 3188 1153531 0.59%
DTM 15882 4704588 2.39%
DTT 4208 763646 0.39%
Helitron 798 302503 0.15%
LINE -- -- --
unknown 1210 464085 0.24%
LTR -- -- --
Gypsy 2374 1294633 0.66%
unknown 9554 3965695 2.01%
MITE -- -- --
DTA 4 1314 0.00%
DTC 124 25886 0.01%
DTH 125 12684 0.01%
DTM 81 20091 0.01%
DTT 6 1783 0.00%
TIR -- -- --
hAT 60 17222 0.01%
Unknown 7671 2299478 1.17%
---------------------------------
total interspersed 83419 24166482 12.26%
---------------------------------------------------------
Total 83419 24166482 12.26%
here is the result : so, a few LINEs were detected, but it is still very weird that so few are picked by the analysis...In my other annotations, lines can cover as much as 8% of this chromosome by themselves.
@Tkastylevsky Without manual curation it's difficult to say whether 0.24% is due to false-negative or 8% is due to false positive. More likely it's both.
The conda EDTA has also been updated to v1.8.3. I will mark this issue solved. Feel free to reopen if necessary.
Hello, it's me again ! so, EDTA finished its run on my chicken chr1. However, I ran into a bit of an issue...No non LTR elements were detected by the analysis, even when I used the --sensitive 1 setting(and I checked, I have a repeatmodeler folder with 6 rounds of library construction in it in my EDTA run folder). However, avian genomes are known to have been invaded by in particular LINE CR1 elements :
As I am testing several annotation methods I have used Repeatmodeler alone on this chromosome and this is what I got after repeatmasking (i'm fusing the repeatmodeler database obtained with Dfam) :
So, why do you think there is such a huge difference between both annotations ? Am I looking into the right files for the results of EDTA ?
PS : the code i used to run EDTA after the EDTA_raw.pl code :
Timothee
EDIT : (I removed some of the results from EDTA for readability purposes since this thread is getting attention)