oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
346 stars 73 forks source link

TEanno.sum generation #169

Closed wyim-pgl closed 3 years ago

wyim-pgl commented 3 years ago

Hi Shujun, Is there any way to get TEanno.sum from TEanno.gff3 Thanks

oushujun commented 3 years ago

Yes, you can. The easiest is to rerun EDTA with --anno 1 --step anno.

If you want to challenge yourself by making a sum file from a random gff3 file, this is a good start:

Convert gff3 to bed

perl ../util/gff2bed.pl $genome.EDTA.TEanno.gff3 > $genome.EDTA.TEanno.bed;

Convert bed to RepeatMasker .out

perl -nle 'my ($chr, $s, $e, $anno, $dir, $supfam)=(split)[0,1,2,3,8,12]; print "10000 0.001 0.001 0.001 $chr $s $e NA $dir $anno $supfam"' $genome.EDTA.TEanno.bed > $genome.EDTA.TEanno.out;

Obtain genome size and seq count

perl .../EDTA/util/count_base.pl $genome > $genome.stats

Regenerate .sum

perl ../util/buildSummary.pl -maxDiv 40 -stats $genome.stats $genome.EDTA.TEanno.out > $genome.EDTA.TEanno.sum 2>/dev/null;

Basically, you need to make an enriched bed file from the gff3 file, then convert the bed file to a fake RepeatMasker .out file, then use the summary script modified from RepeatMasker to produce the .sum file.

Best, Shujun