oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
336 stars 73 forks source link

many TEs in genes #368

Closed jzohren closed 8 months ago

jzohren commented 1 year ago

Hi Shujun,

Thanks for providing us with this easy to use TE annotation pipeline! I ran it on a 330 Mb large plant genome using the following command:

perl EDTA.pl --genome $FASTA --cds $CDS --anno 1 --threads 32 --sensitive 1

It generally went well, however, a lot of TEs were annotated within genes, despite providing the CDS fasta file. Most of these TEs within genes seem to be Helitron or TIR elements. Some are even spanning exon-intron boundaries. Out of the 31,025 genes in our annotation 7,668 overlapped with at least one TE, i.e. 24.7%, which I find surprisingly high. While the true number is difficult to know for a de-novo annotation like this, I suspect that a lot of these are in fact false positives. Below is a screenshot from IGV showing one of many examples of this occurrence. If you, or anyone else who's reading this, has any input on this, I'd much appreciate it.

Thanks! Jasmin

image

oushujun commented 1 year ago

Hi Jasmin,

The example you show is odd. If a lot of these case is like this, you may need to investigate further.

Before so, please categorize these overlaps on where they were found: UTR, CDS, or introns? It's common for TEs inserting into UTRs abd introns.

You may check into the EDTA.final folder abd see how many CDS were removed before used by EDTA. If a lot, the gene annotation may be not as clean as you thought.

Best, Shujun

On Thu, Jul 6, 2023 at 11:37 AM Jasmin Zohren @.***> wrote:

Hi Shujun,

Thanks for providing us with this easy to use TE annotation pipeline! I ran it on a 330 Mb large plant genome using the following command:

perl EDTA.pl --genome $FASTA --cds $CDS --anno 1 --threads 32 --sensitive 1

It generally went well, however, a lot of TEs were annotated within genes, despite providing the CDS fasta file. Most of these TEs within genes seem to be Helitron or TIR elements. Some are even spanning exon-intron boundaries. Out of the 31,025 genes in our annotation 7,668 overlapped with at least one TE, i.e. 24.7%, which I find surprisingly high. While the true number is difficult to know for a de-novo annotation like this, I suspect that a lot of these are in fact false positives. Below is a screenshot from IGV showing one of many examples of this occurrence. If you, or anyone else who's reading this, has any input on this, I'd much appreciate it.

Thanks! Jasmin

[image: image] https://user-images.githubusercontent.com/5310751/251498929-d0097cbe-0fc8-4c59-b183-23c6f0034828.png

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/368, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NGQMU4Q56LEU7R47Y3XO3LTTANCNFSM6AAAAAA2ATVXQA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jzohren commented 1 year ago

Hi Shujun,

Thanks so much for your quick response and your suggestions.

A total of 286623 TEs were found, which were distributed as such:

14048 in genes 8312 in exons 9048 in introns 3993 in CDS 6089 in UTRs

I'm not sure what to look for in the EDTA.final folder, but this is what I found:

$ grep -c ">" XXX.genes.cds.fasta.code.TE
751
$ grep -c ">" XXX.genes.cds.fasta.code.noTE
29681
$ wc -l XXX.fasta.mod.EDTA.intact.removed.gff3
24 
$ grep -c ">" XXX.fasta.mod.EDTA.intact.fa.rmCDS
6388

There are 35281 sequences in the cds.fasta file.

Thanks again, Jasmin

jzohren commented 1 year ago

Accidentally closed the issue with my last comment, reopening it. Sorry about that!

oushujun commented 1 year ago

Looks like the provided CDS is doing it's job and look OK. Among the overlapping TEs, 3993 overlap with CDS which is much less than the original number (could be even less if you count genes carrying these CDS). Further you may want to check what types of TEs are overlapping with CDS and these could be biologically relevant. To check whether the overlapping part is a real TE, you may use that CDS to blast the whole genome and see if there's any alignments other than the CDS itself. Cross checkings like these will help you to learn more about what's going on.

Shujun

oushujun commented 1 year ago

Any updates?