oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
315 stars 70 forks source link

Structural TEs appear not to be renamed by panEDTA #397

Open joannarifkin opened 8 months ago

joannarifkin commented 8 months ago

Hi Shujun,

Another question. I'm running panEDTA on a bunch of species. It says it's successfully reannotating the structurally annotated TEs, e.g.:

Sat Oct 14 15:10:05 EDT 2023 EDTA final stage finished! You may check out: The final EDTA TE library: Cviolacea_585_v2.0.fa.mod.EDTA.TElib.fa Family names of intact TEs have been updated by genome_list_local.txt.panEDTA.TElib.fa: Cviolacea_585_v2.0.fa.mod.EDTA.intact.gff3 Comparing to the provided library, EDTA found these novel TEs: Cviolacea_585_v2.0.fa.mod.EDTA.TElib.novel.fa The provided library has been incorporated into the final library: Cviolacea_585_v2.0.fa.mod.EDTA.TElib.fa

In the output, both Cviolacea_585_v2.0.fa.mod.EDTA.TElib.fa and genome_list_local.txt.panEDTA.TElib.fa include numerous sequences headed "panTE," but in Cviolacea_585_v2.0.fa.mod.EDTA.intact.gff3 no TEs are annotated with the heading "panTE." Similarly, if I filter Cviolacea_585_v2.0.fa.mod.EDTA.TEanno.gff3 for method=structural, no TEs are annotated as "panTE."

The error log features a long run of repeats this message for each genome:

Unspecified/NA not found in the TE_SO database, it will not be used to rename sequences in the final annotation.

I assume this is where the problem is coming from?

This seems to have happened to all the genomes I included, and appears to be just a problem with updating the names. What information would help you solve this?

Thanks!

Joanna

joannarifkin commented 7 months ago

Hi again,

Any sense of what's going on here?

Thanks!

Joanna

joannarifkin commented 7 months ago

Hello again,

Just wondering whether you had any thoughts about what might be wrong here. I tried running just the reannotation command (perl $path/EDTA.pl --genome $genome -t $threads --step final --anno 1 --curatedlib $genome_list.panEDTA.TElib.fa --cds $cds_ind --rmout $genome.mod.panEDTA.out done < $genome_list) separately, but it didn't solve the problem.

Thanks,

Joanna

oushujun commented 7 months ago

Hi Joanna,

Sorry about the long delay. I haven't had the chance to investigate this issue yet, but it's on my to-do list.

Thank you! Shujun

joannarifkin commented 7 months ago

Hi Shujun,

Thanks! I know you're swamped and there's a long backlog. Sorry to pester and I look forward to hearing what the solution is when you have a chance to get to the bottom of it.

All the best,

Joanna

On Mon, Nov 13, 2023 at 7:47 AM Shujun Ou @.***> wrote:

Hi Joanna,

Sorry about the long delay. I haven't had the chance to investigate this issue yet, but it's on my to-do list.

Thank you! Shujun

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/397#issuecomment-1808102353, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CTWOLQB2JNSWDBDVP3YEIJHLAVCNFSM6AAAAAA6MKQNCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBYGEYDEMZVGM . You are receiving this because you authored the thread.Message ID: @.***>

-- Joanna Rifkin PhD they/them Computational biologist

oushujun commented 7 months ago

One thing I noticed is that there is no "panTE" naming in the pan-TE library. They were all named regularly, ie., TE_000xxxxx. TEs in the $genome_list.panEDTA.TElib.fa were used to rename structural TEs, but not all - those single copied will be named as Chrx:xxx..xxx (their coordinate), those multi-copied but not presents enough full-length copies in the genome are remained named as TE_000xxxxx. The latter presents a problem to the pan-TE library because they are named in the same format. I stated several dozen genomes and the fraction of the genome being in the second category is around 0.5% - 2.8%. I think this is an acceptable level, but I will need to change their name to something else to be distinguishable to the pan-TE libraries.

Shujun

joannarifkin commented 7 months ago

Thanks for keeping me updated!

I'm not entirely sure I follow. In my most recent run they do seem to be renamed panTE in genome_list_local.txt.panEDTA.TElib.fa (sample from "grep '>'" below) and are in the species-specific gffs from RepeatMasker, but the new names don't make it all the way into the final [genome].fa.mod.EDTA.[intact|TEanno].gff3. But it sounds like the problem is that in each individual genome, they're being renamed in a way where panEDTA is getting them confused between genomes?

Let me know if you need anything from me!

All the best,

Joanna

panTE_00001958_LTR#LTR/Copia panTE_00001959#DNA/DTM panTE_00001960_LTR#LTR/Copia panTE_00001961_LTR#LTR/Copia panTE_00001962_INT#LTR/Gypsy panTE_00001963_INT#LTR/Copia panTE_00001964_LTR#LTR/Gypsy panTE_00001965_LTR#LTR/Copia panTE_00001966_LTR#LTR/unknown

On Mon, Nov 13, 2023 at 5:40 PM Shujun Ou @.***> wrote:

One thing I noticed is that there is no "panTE" naming in the pan-TE library. They were all named regularly, ie., TE_000xxxxx. TEs in the $genome_list.panEDTA.TElib.fa were used to rename structural TEs, but not all - those single copied will be named as Chrx:xxx..xxx (their coordinate), those multi-copied but not presents enough full-length copies in the genome are remained named as TE_000xxxxx. The latter presents a problem to the pan-TE library because they are named in the same format. I stated several dozen genomes and the fraction of the genome being in the second category is around 0.5% - 2.8%. I think this is an acceptable level, but I will need to change their name to something else to be distinguishable to the pan-TE libraries.

Shujun

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/397#issuecomment-1809249148, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CXEEIWPZHDBNNZJJDDYEKOWNAVCNFSM6AAAAAA6MKQNCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBZGI2DSMJUHA . You are receiving this because you authored the thread.Message ID: @.***>

-- Joanna Rifkin PhD they/them Computational biologist