Open joannarifkin opened 1 year ago
Hi again,
Any sense of what's going on here?
Thanks!
Joanna
Hello again,
Just wondering whether you had any thoughts about what might be wrong here. I tried running just the reannotation command (perl $path/EDTA.pl --genome $genome -t $threads --step final --anno 1 --curatedlib $genome_list.panEDTA.TElib.fa --cds $cds_ind --rmout $genome.mod.panEDTA.out done < $genome_list) separately, but it didn't solve the problem.
Thanks,
Joanna
Hi Joanna,
Sorry about the long delay. I haven't had the chance to investigate this issue yet, but it's on my to-do list.
Thank you! Shujun
Hi Shujun,
Thanks! I know you're swamped and there's a long backlog. Sorry to pester and I look forward to hearing what the solution is when you have a chance to get to the bottom of it.
All the best,
Joanna
On Mon, Nov 13, 2023 at 7:47 AM Shujun Ou @.***> wrote:
Hi Joanna,
Sorry about the long delay. I haven't had the chance to investigate this issue yet, but it's on my to-do list.
Thank you! Shujun
— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/397#issuecomment-1808102353, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CTWOLQB2JNSWDBDVP3YEIJHLAVCNFSM6AAAAAA6MKQNCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBYGEYDEMZVGM . You are receiving this because you authored the thread.Message ID: @.***>
-- Joanna Rifkin PhD they/them Computational biologist
One thing I noticed is that there is no "panTE" naming in the pan-TE library. They were all named regularly, ie., TE_000xxxxx. TEs in the $genome_list.panEDTA.TElib.fa were used to rename structural TEs, but not all - those single copied will be named as Chrx:xxx..xxx (their coordinate), those multi-copied but not presents enough full-length copies in the genome are remained named as TE_000xxxxx. The latter presents a problem to the pan-TE library because they are named in the same format. I stated several dozen genomes and the fraction of the genome being in the second category is around 0.5% - 2.8%. I think this is an acceptable level, but I will need to change their name to something else to be distinguishable to the pan-TE libraries.
Shujun
Thanks for keeping me updated!
I'm not entirely sure I follow. In my most recent run they do seem to be renamed panTE in genome_list_local.txt.panEDTA.TElib.fa (sample from "grep '>'" below) and are in the species-specific gffs from RepeatMasker, but the new names don't make it all the way into the final [genome].fa.mod.EDTA.[intact|TEanno].gff3. But it sounds like the problem is that in each individual genome, they're being renamed in a way where panEDTA is getting them confused between genomes?
Let me know if you need anything from me!
All the best,
Joanna
panTE_00001958_LTR#LTR/Copia panTE_00001959#DNA/DTM panTE_00001960_LTR#LTR/Copia panTE_00001961_LTR#LTR/Copia panTE_00001962_INT#LTR/Gypsy panTE_00001963_INT#LTR/Copia panTE_00001964_LTR#LTR/Gypsy panTE_00001965_LTR#LTR/Copia panTE_00001966_LTR#LTR/unknown
On Mon, Nov 13, 2023 at 5:40 PM Shujun Ou @.***> wrote:
One thing I noticed is that there is no "panTE" naming in the pan-TE library. They were all named regularly, ie., TE_000xxxxx. TEs in the $genome_list.panEDTA.TElib.fa were used to rename structural TEs, but not all - those single copied will be named as Chrx:xxx..xxx (their coordinate), those multi-copied but not presents enough full-length copies in the genome are remained named as TE_000xxxxx. The latter presents a problem to the pan-TE library because they are named in the same format. I stated several dozen genomes and the fraction of the genome being in the second category is around 0.5% - 2.8%. I think this is an acceptable level, but I will need to change their name to something else to be distinguishable to the pan-TE libraries.
Shujun
— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/397#issuecomment-1809249148, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CXEEIWPZHDBNNZJJDDYEKOWNAVCNFSM6AAAAAA6MKQNCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBZGI2DSMJUHA . You are receiving this because you authored the thread.Message ID: @.***>
-- Joanna Rifkin PhD they/them Computational biologist
Hi Shujun,
Just circling back to this after working on some other projects. Would it work to just rename everything based on the panEDTA library, like in this question? https://github.com/oushujun/EDTA/issues/205
If so, should I just run that command with the genomes.txt.panEDTA.TElib.fa file for each genome individually?
Thanks!
Joanna
Yes, using a panTE library for individual EDTA runs effectively is the pan-genome approach. You may want to rename the sequences in your panTE library to distinguish from novel TEs indentified in each EDTA run. These novel TEs are likely to be the remaining repetitive sequences and may not be of high quality - consider removing them.
Shujun
Thanks! I can just remove them from the gff for downstream analyses. Thanks again for the help!
Hi Shujun,
Another question. I'm running panEDTA on a bunch of species. It says it's successfully reannotating the structurally annotated TEs, e.g.:
In the output, both Cviolacea_585_v2.0.fa.mod.EDTA.TElib.fa and genome_list_local.txt.panEDTA.TElib.fa include numerous sequences headed "panTE," but in Cviolacea_585_v2.0.fa.mod.EDTA.intact.gff3 no TEs are annotated with the heading "panTE." Similarly, if I filter Cviolacea_585_v2.0.fa.mod.EDTA.TEanno.gff3 for method=structural, no TEs are annotated as "panTE."
The error log features a long run of repeats this message for each genome:
I assume this is where the problem is coming from?
This seems to have happened to all the genomes I included, and appears to be just a problem with updating the names. What information would help you solve this?
Thanks!
Joanna