oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
330 stars 72 forks source link

Lots of unknowns #159

Closed fungal-spore closed 3 years ago

fungal-spore commented 3 years ago

Hi, EDTA found lots of unknown TEs, just wondering what would have flagged these as TEs in the first place if they have unknown? Repeat content? I want to be able to say something like "TEs categorized as unknown were classed as TEs based on....."

I attempted to upload image here of parse results, hopefully it came through ok. Thanks!

image

oushujun commented 3 years ago

Hi,

EDTA uses structural features to identify intact TEs at the beginning. For example, if a sequence has terminal repeat and satisfied a number of related features, then it's classified as LTR retrotransposons. Then EDTA will try to classify TEs into superfamilies, ie Gypsy and Copia, based on coding features, otherwise will be named LTR/unknown.

If you use the --sensitive 1 option, then RepeatModeler2 will be recruited to identify repetitive sequences that were not reported by the structural module of EDTA. Due to the lack of homology and coding features, most of them are named unknown/unknown. We have lower confidence in these categories so you may want to filter them with more measures, eg. copy number, overlap with genes, etc.

Best, Shujun