Closed dcopetti closed 2 years ago
Hi Dario,
Sorry for the delay. Your idea of using existing libraries to supplement EDTA is correct. The heavily unclassified content you are seeing is due to the use of a library that is not properly named in the RepeatMasker ID format (even though it is from Repbase). You may use our curated rice library for this purpose: https://github.com/oushujun/EDTA/blob/master/database/rice6.9.5.liban
A bit more about this library. It contains SINEs and LINEs that are curated over the years by Ning Jiang. It also contains the centromeric repeat named Os1304#Centro/tandem
, but it does not have the rDNA sequence.
You definitely can identify the rDNA and other tandem repeats and add them to this library and supply it to EDTA. It would just make the annotation better. To make your sequences recognizable to EDTA, you need to follow this naming convention: https://github.com/oushujun/EDTA/blob/master/util/TE_Sequence_Ontology.txt, for example, you may name an rDNA sequence like: >rice_rDNA_spacer#rDNA/spacer
, the part #rDNA/spacer
and the like is required by RepeatMasker to put sequence into proper classifications.
Let me know if you have more questions.
Best, Shujun
Thank you Shujun! I will use that curated rice library and add some rDNA then.
Hi Dario,
I would like to get your awareness that #rDNA/spacer
is just part of the full rDNA repeat. There are other rDNA components entries in EDTA:
rDNA_intergenic_spacer_element SO:0001860 rDNA_intergenic_spacer_element,rDNA/spacer,rDNA/IGS
2S_rRNA_gene SO:0002336 2S_rRNA_gene,rRNA_2S_gene,cytosolic_rRNA_2S_gene,rDNA/2S,2S_rRNA
5S_rRNA_gene SO:0002238 5S_rRNA_gene,cytosolic_rRNA_5S_gene,rDNA/5S,5S_rRNA
5_8S_rRNA_gene SO:0002240 5_8S_rRNA_gene,cytosolic_rRNA_5_8S_gene,rDNA/5.8S,5.8S_rRNA,rDNA/5_8S,5_8S_rRNA
23S_rRNA_gene SO:0002243 23S_rRNA_gene,rDNA/23S,23S_rRNA
25S_rRNA_gene SO:0002242 25S_rRNA_gene,rDNA/25S,25S_rRNA
28S_rRNA_gene SO:0002239 28S_rRNA_gene,rDNA/28S,28S_rRNA
18S_rRNA_gene SO:0002236 18S_rRNA_gene,cytosolic_rRNA_18S_gene,rDNA/18S,18S_rRNA
16S_rRNA_gene SO:0002237 16S_rRNA_gene,cytosolic_rRNA_16S_gene,rDNA/16S,16S_rRNA
rRNA_gene SO:0002360 rRNA_gene,rDNA/45S
rRNA SO:0000252 rRNA
If you can curate some of the rice rDNA sequences, that would be great! And I would love to include them in the rice repeat library. Please let me know.
Best, Shujun
Hi Shujun, Sounds good: since we will work on several rice genome assemblies, we would benefit from a comprehensive characterization of all repeats. Soon we will have a researcher working on this and we will put together a set or representative centromeric and ribosomal sequences to add to the rice library. We will get back when we have updates. Cheers, Dario
That's exciting! Let me know if you need any help. Good luck.
Cheers, Shujun
Hello,
I would like to annotate a gapless rice assembly, so I run EDTA with this command:
EDTA.pl --genome genome.fa --species Rice --cds CDS.fa --curatedlib ../RepBase22.03.fasta/oryrep.ref --sensitive 1 --anno 1 --evaluate 1 -t 30
I thought of adding RepBase's rice curated library as a guide for naming and completeness (i.e. see LINEs, later). The TEanno.sum file looks like this:I think the total amount of repeats if fair/OK, I did not expect that so many bases were left unclassified. Is this normal?
Then, knowing EDTA is not good at finding LINEs de novo, I thought that they could be found by homology from the curated library. Does EDTA use the curated library at all (e.g with RepeatMasker?) I think that that would be a step that can help recover known TEs - maybe just for some category? Then, when a region found with the RMasker is also found by the main EDTA pipeline, the former should be overwritten for example I know the tools to find LINEs de novo are tricky.
Lastly, it would be nice to have masked also other important repeated sequences like rDNA and tandem repeats. With complete genomes one can get easily (Tandem Repeats Finder?) a few representative of these, it would be nice to be able to supply them as input and get also that annotation in the same gff. Any plan on that? Thanks,
Dario