Open joannarifkin opened 8 months ago
Hi again,
Any sense of what's going on here?
Thanks!
Joanna
Hello again,
Just wondering whether you had any thoughts about what might be wrong here. I tried running just the reannotation command (perl $path/EDTA.pl --genome $genome -t $threads --step final --anno 1 --curatedlib $genome_list.panEDTA.TElib.fa --cds $cds_ind --rmout $genome.mod.panEDTA.out done < $genome_list) separately, but it didn't solve the problem.
Thanks,
Joanna
Hi Joanna,
Sorry about the long delay. I haven't had the chance to investigate this issue yet, but it's on my to-do list.
Thank you! Shujun
Hi Shujun,
Thanks! I know you're swamped and there's a long backlog. Sorry to pester and I look forward to hearing what the solution is when you have a chance to get to the bottom of it.
All the best,
Joanna
On Mon, Nov 13, 2023 at 7:47 AM Shujun Ou @.***> wrote:
Hi Joanna,
Sorry about the long delay. I haven't had the chance to investigate this issue yet, but it's on my to-do list.
Thank you! Shujun
— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/397#issuecomment-1808102353, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CTWOLQB2JNSWDBDVP3YEIJHLAVCNFSM6AAAAAA6MKQNCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBYGEYDEMZVGM . You are receiving this because you authored the thread.Message ID: @.***>
-- Joanna Rifkin PhD they/them Computational biologist
One thing I noticed is that there is no "panTE" naming in the pan-TE library. They were all named regularly, ie., TE_000xxxxx. TEs in the $genome_list.panEDTA.TElib.fa were used to rename structural TEs, but not all - those single copied will be named as Chrx:xxx..xxx (their coordinate), those multi-copied but not presents enough full-length copies in the genome are remained named as TE_000xxxxx. The latter presents a problem to the pan-TE library because they are named in the same format. I stated several dozen genomes and the fraction of the genome being in the second category is around 0.5% - 2.8%. I think this is an acceptable level, but I will need to change their name to something else to be distinguishable to the pan-TE libraries.
Shujun
Thanks for keeping me updated!
I'm not entirely sure I follow. In my most recent run they do seem to be renamed panTE in genome_list_local.txt.panEDTA.TElib.fa (sample from "grep '>'" below) and are in the species-specific gffs from RepeatMasker, but the new names don't make it all the way into the final [genome].fa.mod.EDTA.[intact|TEanno].gff3. But it sounds like the problem is that in each individual genome, they're being renamed in a way where panEDTA is getting them confused between genomes?
Let me know if you need anything from me!
All the best,
Joanna
panTE_00001958_LTR#LTR/Copia panTE_00001959#DNA/DTM panTE_00001960_LTR#LTR/Copia panTE_00001961_LTR#LTR/Copia panTE_00001962_INT#LTR/Gypsy panTE_00001963_INT#LTR/Copia panTE_00001964_LTR#LTR/Gypsy panTE_00001965_LTR#LTR/Copia panTE_00001966_LTR#LTR/unknown
On Mon, Nov 13, 2023 at 5:40 PM Shujun Ou @.***> wrote:
One thing I noticed is that there is no "panTE" naming in the pan-TE library. They were all named regularly, ie., TE_000xxxxx. TEs in the $genome_list.panEDTA.TElib.fa were used to rename structural TEs, but not all - those single copied will be named as Chrx:xxx..xxx (their coordinate), those multi-copied but not presents enough full-length copies in the genome are remained named as TE_000xxxxx. The latter presents a problem to the pan-TE library because they are named in the same format. I stated several dozen genomes and the fraction of the genome being in the second category is around 0.5% - 2.8%. I think this is an acceptable level, but I will need to change their name to something else to be distinguishable to the pan-TE libraries.
Shujun
— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/397#issuecomment-1809249148, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CXEEIWPZHDBNNZJJDDYEKOWNAVCNFSM6AAAAAA6MKQNCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBZGI2DSMJUHA . You are receiving this because you authored the thread.Message ID: @.***>
-- Joanna Rifkin PhD they/them Computational biologist
Hi Shujun,
Another question. I'm running panEDTA on a bunch of species. It says it's successfully reannotating the structurally annotated TEs, e.g.:
In the output, both Cviolacea_585_v2.0.fa.mod.EDTA.TElib.fa and genome_list_local.txt.panEDTA.TElib.fa include numerous sequences headed "panTE," but in Cviolacea_585_v2.0.fa.mod.EDTA.intact.gff3 no TEs are annotated with the heading "panTE." Similarly, if I filter Cviolacea_585_v2.0.fa.mod.EDTA.TEanno.gff3 for method=structural, no TEs are annotated as "panTE."
The error log features a long run of repeats this message for each genome:
I assume this is where the problem is coming from?
This seems to have happened to all the genomes I included, and appears to be just a problem with updating the names. What information would help you solve this?
Thanks!
Joanna