oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
349 stars 73 forks source link

quality of rice annotation #292

Closed dcopetti closed 2 years ago

dcopetti commented 2 years ago

Hello,

I would like to annotate a gapless rice assembly, so I run EDTA with this command: EDTA.pl --genome genome.fa --species Rice --cds CDS.fa --curatedlib ../RepBase22.03.fasta/oryrep.ref --sensitive 1 --anno 1 --evaluate 1 -t 30 I thought of adding RepBase's rice curated library as a guide for naming and completeness (i.e. see LINEs, later). The TEanno.sum file looks like this:

Repeat Classes
==============
Total Sequences: 12
Total Length: 391561630 bp
Class                  Count        bpMasked    %masked
=====                  =====        ========     =======
LTR                    --           --           --
    Copia              729          2514604      0.64%
    Gypsy              2901         23647886     6.04%
    unknown            486          1635356      0.42%
TIR                    --           --           --
    CACTA              3350         1704954      0.44%
    Mutator            13103        5810349      1.48%
    PIF_Harbinger      2529         787860       0.20%
    Tc1_Mariner        20670        7411431      1.89%
    hAT                2709         1137500      0.29%
nonLTR                 --           --           --
    LINE_element       100          48766        0.01%
nonTIR                 --           --           --
    helitron           23427        11192032     2.86%
repeat_region          263645       151570963    38.71%
                      ---------------------------------
    total interspersed 333649       207461701    52.98%

---------------------------------------------------------
Total                  333649       207461701    52.98%

I think the total amount of repeats if fair/OK, I did not expect that so many bases were left unclassified. Is this normal?

Then, knowing EDTA is not good at finding LINEs de novo, I thought that they could be found by homology from the curated library. Does EDTA use the curated library at all (e.g with RepeatMasker?) I think that that would be a step that can help recover known TEs - maybe just for some category? Then, when a region found with the RMasker is also found by the main EDTA pipeline, the former should be overwritten for example I know the tools to find LINEs de novo are tricky.

Lastly, it would be nice to have masked also other important repeated sequences like rDNA and tandem repeats. With complete genomes one can get easily (Tandem Repeats Finder?) a few representative of these, it would be nice to be able to supply them as input and get also that annotation in the same gff. Any plan on that? Thanks,

Dario

oushujun commented 2 years ago

Hi Dario,

Sorry for the delay. Your idea of using existing libraries to supplement EDTA is correct. The heavily unclassified content you are seeing is due to the use of a library that is not properly named in the RepeatMasker ID format (even though it is from Repbase). You may use our curated rice library for this purpose: https://github.com/oushujun/EDTA/blob/master/database/rice6.9.5.liban

A bit more about this library. It contains SINEs and LINEs that are curated over the years by Ning Jiang. It also contains the centromeric repeat named Os1304#Centro/tandem, but it does not have the rDNA sequence.

You definitely can identify the rDNA and other tandem repeats and add them to this library and supply it to EDTA. It would just make the annotation better. To make your sequences recognizable to EDTA, you need to follow this naming convention: https://github.com/oushujun/EDTA/blob/master/util/TE_Sequence_Ontology.txt, for example, you may name an rDNA sequence like: >rice_rDNA_spacer#rDNA/spacer, the part #rDNA/spacer and the like is required by RepeatMasker to put sequence into proper classifications.

Let me know if you have more questions.

Best, Shujun

dcopetti commented 2 years ago

Thank you Shujun! I will use that curated rice library and add some rDNA then.

oushujun commented 2 years ago

Hi Dario,

I would like to get your awareness that #rDNA/spacer is just part of the full rDNA repeat. There are other rDNA components entries in EDTA:

rDNA_intergenic_spacer_element  SO:0001860  rDNA_intergenic_spacer_element,rDNA/spacer,rDNA/IGS
2S_rRNA_gene    SO:0002336  2S_rRNA_gene,rRNA_2S_gene,cytosolic_rRNA_2S_gene,rDNA/2S,2S_rRNA
5S_rRNA_gene    SO:0002238  5S_rRNA_gene,cytosolic_rRNA_5S_gene,rDNA/5S,5S_rRNA
5_8S_rRNA_gene  SO:0002240  5_8S_rRNA_gene,cytosolic_rRNA_5_8S_gene,rDNA/5.8S,5.8S_rRNA,rDNA/5_8S,5_8S_rRNA
23S_rRNA_gene   SO:0002243  23S_rRNA_gene,rDNA/23S,23S_rRNA
25S_rRNA_gene   SO:0002242  25S_rRNA_gene,rDNA/25S,25S_rRNA
28S_rRNA_gene   SO:0002239  28S_rRNA_gene,rDNA/28S,28S_rRNA
18S_rRNA_gene   SO:0002236  18S_rRNA_gene,cytosolic_rRNA_18S_gene,rDNA/18S,18S_rRNA
16S_rRNA_gene   SO:0002237  16S_rRNA_gene,cytosolic_rRNA_16S_gene,rDNA/16S,16S_rRNA
rRNA_gene   SO:0002360  rRNA_gene,rDNA/45S
rRNA    SO:0000252  rRNA

If you can curate some of the rice rDNA sequences, that would be great! And I would love to include them in the rice repeat library. Please let me know.

Best, Shujun

dcopetti commented 2 years ago

Hi Shujun, Sounds good: since we will work on several rice genome assemblies, we would benefit from a comprehensive characterization of all repeats. Soon we will have a researcher working on this and we will put together a set or representative centromeric and ribosomal sequences to add to the rice library. We will get back when we have updates. Cheers, Dario

oushujun commented 2 years ago

That's exciting! Let me know if you need any help. Good luck.

Cheers, Shujun