Structural TEs appear not to be renamed by panEDTA

joannarifkin commented 1 year ago

Hi Shujun,

Another question. I'm running panEDTA on a bunch of species. It says it's successfully reannotating the structurally annotated TEs, e.g.:

Sat Oct 14 15:10:05 EDT 2023 EDTA final stage finished! You may check out: The final EDTA TE library: Cviolacea_585_v2.0.fa.mod.EDTA.TElib.fa Family names of intact TEs have been updated by genome_list_local.txt.panEDTA.TElib.fa: Cviolacea_585_v2.0.fa.mod.EDTA.intact.gff3 Comparing to the provided library, EDTA found these novel TEs: Cviolacea_585_v2.0.fa.mod.EDTA.TElib.novel.fa The provided library has been incorporated into the final library: Cviolacea_585_v2.0.fa.mod.EDTA.TElib.fa

In the output, both Cviolacea_585_v2.0.fa.mod.EDTA.TElib.fa and genome_list_local.txt.panEDTA.TElib.fa include numerous sequences headed "panTE," but in Cviolacea_585_v2.0.fa.mod.EDTA.intact.gff3 no TEs are annotated with the heading "panTE." Similarly, if I filter Cviolacea_585_v2.0.fa.mod.EDTA.TEanno.gff3 for method=structural, no TEs are annotated as "panTE."

The error log features a long run of repeats this message for each genome:

Unspecified/NA not found in the TE_SO database, it will not be used to rename sequences in the final annotation.

I assume this is where the problem is coming from?

This seems to have happened to all the genomes I included, and appears to be just a problem with updating the names. What information would help you solve this?

Thanks!

Joanna

joannarifkin commented 1 year ago

Hi again,

Any sense of what's going on here?

Thanks!

Joanna

joannarifkin commented 1 year ago

Hello again,

Just wondering whether you had any thoughts about what might be wrong here. I tried running just the reannotation command (perl $path/EDTA.pl --genome $genome -t $threads --step final --anno 1 --curatedlib $genome_list.panEDTA.TElib.fa --cds $cds_ind --rmout $genome.mod.panEDTA.out done < $genome_list) separately, but it didn't solve the problem.

Thanks,

Joanna

oushujun commented 1 year ago

Hi Joanna,

Sorry about the long delay. I haven't had the chance to investigate this issue yet, but it's on my to-do list.

Thank you! Shujun

joannarifkin commented 1 year ago

Hi Shujun,

Thanks! I know you're swamped and there's a long backlog. Sorry to pester and I look forward to hearing what the solution is when you have a chance to get to the bottom of it.

All the best,

Joanna

On Mon, Nov 13, 2023 at 7:47 AM Shujun Ou @.***> wrote:

Hi Joanna,

Sorry about the long delay. I haven't had the chance to investigate this issue yet, but it's on my to-do list.

Thank you! Shujun

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/397#issuecomment-1808102353, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CTWOLQB2JNSWDBDVP3YEIJHLAVCNFSM6AAAAAA6MKQNCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBYGEYDEMZVGM . You are receiving this because you authored the thread.Message ID: @.***>

-- Joanna Rifkin PhD they/them Computational biologist

oushujun commented 1 year ago

One thing I noticed is that there is no "panTE" naming in the pan-TE library. They were all named regularly, ie., TE_000xxxxx. TEs in the $genome_list.panEDTA.TElib.fa were used to rename structural TEs, but not all - those single copied will be named as Chrx:xxx..xxx (their coordinate), those multi-copied but not presents enough full-length copies in the genome are remained named as TE_000xxxxx. The latter presents a problem to the pan-TE library because they are named in the same format. I stated several dozen genomes and the fraction of the genome being in the second category is around 0.5% - 2.8%. I think this is an acceptable level, but I will need to change their name to something else to be distinguishable to the pan-TE libraries.

Shujun

joannarifkin commented 1 year ago

Thanks for keeping me updated!

I'm not entirely sure I follow. In my most recent run they do seem to be renamed panTE in genome_list_local.txt.panEDTA.TElib.fa (sample from "grep '>'" below) and are in the species-specific gffs from RepeatMasker, but the new names don't make it all the way into the final [genome].fa.mod.EDTA.[intact|TEanno].gff3. But it sounds like the problem is that in each individual genome, they're being renamed in a way where panEDTA is getting them confused between genomes?

Let me know if you need anything from me!

All the best,

Joanna

panTE_00001958_LTR#LTR/Copia panTE_00001959#DNA/DTM panTE_00001960_LTR#LTR/Copia panTE_00001961_LTR#LTR/Copia panTE_00001962_INT#LTR/Gypsy panTE_00001963_INT#LTR/Copia panTE_00001964_LTR#LTR/Gypsy panTE_00001965_LTR#LTR/Copia panTE_00001966_LTR#LTR/unknown

On Mon, Nov 13, 2023 at 5:40 PM Shujun Ou @.***> wrote:

One thing I noticed is that there is no "panTE" naming in the pan-TE library. They were all named regularly, ie., TE_000xxxxx. TEs in the $genome_list.panEDTA.TElib.fa were used to rename structural TEs, but not all - those single copied will be named as Chrx:xxx..xxx (their coordinate), those multi-copied but not presents enough full-length copies in the genome are remained named as TE_000xxxxx. The latter presents a problem to the pan-TE library because they are named in the same format. I stated several dozen genomes and the fraction of the genome being in the second category is around 0.5% - 2.8%. I think this is an acceptable level, but I will need to change their name to something else to be distinguishable to the pan-TE libraries.

Shujun

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/397#issuecomment-1809249148, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CXEEIWPZHDBNNZJJDDYEKOWNAVCNFSM6AAAAAA6MKQNCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBZGI2DSMJUHA . You are receiving this because you authored the thread.Message ID: @.***>

-- Joanna Rifkin PhD they/them Computational biologist

joannarifkin commented 2 months ago

Hi Shujun,

Just circling back to this after working on some other projects. Would it work to just rename everything based on the panEDTA library, like in this question? https://github.com/oushujun/EDTA/issues/205

If so, should I just run that command with the genomes.txt.panEDTA.TElib.fa file for each genome individually?

Thanks!

Joanna

oushujun commented 1 month ago

Yes, using a panTE library for individual EDTA runs effectively is the pan-genome approach. You may want to rename the sequences in your panTE library to distinguish from novel TEs indentified in each EDTA run. These novel TEs are likely to be the remaining repetitive sequences and may not be of high quality - consider removing them.

Shujun

joannarifkin commented 1 month ago

Thanks! I can just remove them from the gff for downstream analyses. Thanks again for the help!

oushujun / EDTA

Structural TEs appear not to be renamed by panEDTA #397