TE_XXX in gff3 from panEDTA

Hello,

I am using EDTA+panEDTA to annotate genomes of 40 related species. I annotated each genome individually with EDTA v2.2.0 and generated a panEDTA library. Then for each genome, I run

RepeatMasker -e ncbi -pa 40 -q -div 40 -lib ${panEDTA.TElib} -cutoff 225 -gff ${genome}.mod.panEDTA > /dev/null
perl -i -nle 's/\s+DNA\s+/\tDNA\/unknown\t/; print $_' ${genome}.mod.panEDTA.out
EDTA.pl --genome ${genome}, -t 40 --step final --anno 1 --curatedlib ${panEDTA.TElib} --cds ${cds} --rmout ${genome}.mod.panEDTA.out

These are copy-paste from panEDTA.sh for parallization.

In my understanding, each sequence in the panEDTA TE library should represent a TE family. I am trying to extract genomic sequences for each TE family. I found some unusual Names in attributes field of TEanno.gff3: (1) There are some panTE_XXX in gff3 but not in panEDTA.TElib. Instead, there are panTE_XXX_INT and panTE_XXX_LTR in panEDTA.TElib. (2) There are TE_XXX in gff3, but not in panEDTA.TElib.

Lastly, how would you count the copy number of each TE family? I checked the ratio between length of regions in the gff3 and of corresponding sequences in panEDTA.TElib, and it differs a lot. Here are quantiles of the ratio:

> quantile(df$lengthABOVETE.fam.len,na.rm =TRUE,probs=seq(0,1,0.1))
          0%          10%          20%          30%          40%          50% 
 0.005845817  0.080485612  0.116917626  0.162465915  0.221638655  0.288018433 
         60%          70%          80%          90%         100% 
 0.376657825  0.494324624  0.678725237  0.937500000 73.812785388

I suspect whether these extremely short/long regions are really transposons and I am not sure whether it is a good idea to include them in analysis analysis on evolution of individual TE family (e.g. copy number dynamics). Do you have any suggestion?

Sincerely,

Cong

oushujun / EDTA

TE_XXX in gff3 from panEDTA #462