oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
330 stars 72 forks source link

Is there an output which contains more detailed data on nested insertions? #260

Closed swomics closed 2 years ago

swomics commented 2 years ago

Hi, I've had some really interesting results from EDTA, thank you for your work on this program. I have a question about the nested TE analysis performed by EDTA: I have looked at the files ".stat.all.sum" ".stat.nested.sum" and ".stat.redun.sum" but these outputs appear to be quite high-level data. Is it possible to see the underlying annotations, for example the locations of specific nested insertions, alongside the inferred identity of the more recent and the ansestral insertions?

oushujun commented 2 years ago

Hi @swomics,

Glad that EDTA is helping with your study. The files ".stat.all.sum" ".stat.nested.sum" and ".stat.redun.sum" are to describe the level of annotation inconsistency in the data. The inconsistency could be due to technical issues (EDTA fails to classify some TEs accurately) or biological reasons (some TEs are nested inside other TEs). Basically, the ".stat.redun.sum" file is to describe the technical inconsistency and the ".stat.nested.sum" file is to describe biological inconsistency, while the ".stat.all.sum" file is the summary of both.

To identify nested insertions, you may use bedtools intersect on GFF3 files or simply filter for TE annotations with their coordinates located within other TEs. To distinguish recent or ancestral insertions for nested insertions, you may need more sophisticated approaches that I am not sure if they are available around. For nested LTR retrotransposons, such information is quite easy to infer by comparing the LTR identity of internal and external LTRs.

Best, Shujun

swomics commented 2 years ago

Thanks for the quick reply Shujun! I will have a go at the two suggested approaches.