oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
315 stars 70 forks source link

panEDTA path specification #375

Closed joannarifkin closed 3 months ago

joannarifkin commented 11 months ago

Hi Shujun,

Thanks as always for EDTA!

I think this is a silly question - I'm having trouble figuring out how to specify the paths for the genomes I'm putting into panEDTA so the existing annotations will be found rather than rerun. I've tried both putting the original genome paths in the genomes list file and copying the TE annotations to the directory where I'm running panEDTA but I don't think I did it right.

Thanks!

Joanna

oushujun commented 11 months ago

Hi Joanna,

I recently updated panEDTA, please try out this new version and let me know. Simply unzip and replace the original panEDTA.sh script in the EDTA folder. The help info of the scrip is also updated. panEDTA.sh.zip

Thanks! Shujun

joannarifkin commented 10 months ago

Hi Shujun,

Thanks! I have the new version running, and I'm getting the following error:

ERROR: Raw LTR results not found in Crubella_474_v1_names_shortened.1.cds.fa.mod.EDTA.raw/Crubella_474_v1_names_shortened.1.cds.fa.mod.LTR.raw.fa If you believe the program is working properly, this may be caused by the lack of intact LTRs in your genome. Consider to use the --force 1 parameter to overwrite this check ERROR: Initial EDTA failed for Crubella_474_v1_names_shortened.1.cds.fa

It seems to fail on annotating one of the CDS files, which it wasn't doing before. Is this expected? I've attached my genome paths file - this cds also created a bunch of line length warnings in the updated but not new version so I'm wondering if there's something up with spacers or EOL encoding in my input.

Thanks!

Joanna

genome_list_paths_updated.txt

oushujun commented 10 months ago

Hi Joanna,

EDTA is not designed to annotate TEs in CDS files. You need to provide the whole genome to the program, and you may use the CDS file to facilitate the removal of genes in the TE annotation.

Thanks, Shujun

joannarifkin commented 10 months ago

Hi Shujun,

Thanks!

I know it's not supposed to be trying to annotate the TEs in the CDS - I don't understand why it is trying to do so. When I ran it before, it correctly interpreted the CDS paths in the second column as facilitating gene removal, but now the error messages suggest the CDS files are being read in as genomes to annotate instead.

Cheers,

Joanna

On Tue, Aug 8, 2023 at 3:46 PM Shujun Ou @.***> wrote:

Hi Joanna,

EDTA is not designed to annotate TEs in CDS files. You need to provide the whole genome to the program, and you may use the CDS file to facilitate the removal of genes in the TE annotation.

Thanks, Shujun

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/375#issuecomment-1670212018, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CR5767M5GO7ZVO3SI3XUKJQDANCNFSM6AAAAAA26PSLZQ . You are receiving this because you authored the thread.Message ID: @.***>

-- Joanna Rifkin PhD they/them Postdoctoral fellow at the University of Michigan

joannarifkin commented 10 months ago

Hi again!

I tried rerunning it with the genomes + CDS list in a different order and got this error:

Wed Aug 9 15:24:54 EDT 2023 ERROR: Fail to convert seq IDs to <= 13 characters! Please provide a genome with shorter seq IDs. ERROR: Initial EDTA failed for Cviolacea_585_v2.1.cds_primaryTranscriptOnly.fa

So it definitely appears to be trying to run EDTA on the CDS files rather than recognizing them as CDS files with sequence to exclude.

Cheers,

Joanna

oushujun commented 10 months ago

Hi Joanna,

It appears to be a bug, and sorry for the issue. I have a colleague currently testing this. For the moment, if you can use CDS files of closely related species to replace the genomes without CDS files, and make two complete columns of the genome list file, it should be able to bypass.

Thanks, Shujun

joannarifkin commented 10 months ago

Hi Shujun,

Gotcha. So instead of:

genome genome cds genome cds

I should have

genome [related cds] genome cds genome cds

Right?

Would it be better to just fill in the Arabidopsis CDS for all the species without their own CDS, or try to find something closer? (They're all in the Brassicaceae.)

Thanks!

Joanna

On Wed, Aug 9, 2023 at 3:34 PM Shujun Ou @.***> wrote:

Hi Joanna,

It appears to be a bug, and sorry for the issue. I have a colleague currently testing this. For the moment, if you can use CDS files of closely related species to replace the genomes without CDS files, and make two complete columns of the genome list file, it should be able to bypass.

Thanks, Shujun

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/375#issuecomment-1672029721, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CS3FHDIAH245YHPM2DXUPQ4HANCNFSM6AAAAAA26PSLZQ . You are receiving this because you authored the thread.Message ID: @.***>

-- Joanna Rifkin PhD they/them Postdoctoral fellow at the University of Michigan

oushujun commented 10 months ago

Yes, that's correct. Try something closer but with good gene annotation quality, otherwise, Arabidopsis works perfectly fine.

Shujun

joannarifkin commented 10 months ago

Excellent, thanks!

On Wed, Aug 9, 2023 at 3:51 PM Shujun Ou @.***> wrote:

Yes, that's correct. Try something closer but with good gene annotation quality, otherwise, Arabidopsis works perfectly fine.

Shujun

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/375#issuecomment-1672052919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CVW6RTIMCLQK5CIBTTXUPS3JANCNFSM6AAAAAA26PSLZQ . You are receiving this because you authored the thread.Message ID: @.***>

-- Joanna Rifkin PhD they/them Postdoctoral fellow at the University of Michigan

joannarifkin commented 10 months ago

Hi again!

I tried just using the Arabidopsis CDS seem to get the same issue:

Error: Error while loading sequence perl make_bed_with_intact.pl EDTA.intact.fa > EDTA.intact.bed

Wed Aug 9 17:47:28 EDT 2023 Warning: The Helitron result file has 0 bp!

Wed Aug 9 17:47:28 EDT 2023 Execution of EDTA_raw.pl is finished!

ERROR: Raw LTR results not found in Araport11_cds_20220914.mod.EDTA.raw/Araport11_cds_20220914.mod.LTR.raw.fa If you believe the program is working properly, this may be caused by the lack of intact LTRs in your genome. Consider to use the --force 1 parameter to overwrite this check ERROR: Initial EDTA failed for Araport11_cds_20220914

I can focus on some other projects until your colleague has tracked down the bug. Let me know if any of my full logs or commands will be helpful for the troubleshooting process!

Cheers,

Joanna

oushujun commented 10 months ago

Hi Joanna,

The easiest "fix" is to run the code with bash, not sh, zsh or other shell variants: bash panEDTA.sh ...

Let me know if you still have trouble running it. I will also try to update the code and make it more adaptive to shell variants.

Thanks! Shujun

joannarifkin commented 10 months ago

Hi Shujun,

Thanks, I'll give that a try.

Cheers,

Joanna

On Tue, Aug 15, 2023 at 11:31 PM Shujun Ou @.***> wrote:

Hi Joanna,

The easiest "fix" is to run the code with bash, not sh, zsh or other shell variants: bash panEDTA.sh ...

Let me know if you still have trouble running it. I will also try to update the code and make it more adaptive to shell variants.

Thanks! Shujun

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/375#issuecomment-1679907362, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CUIDBRPJGK7BOWEISTXVQ5KTANCNFSM6AAAAAA26PSLZQ . You are receiving this because you authored the thread.Message ID: @.***>

-- Joanna Rifkin PhD they/them Postdoctoral fellow at the University of Michigan

joannarifkin commented 10 months ago

Hi Shujun,

I've tried submitting the job with "bash" rather than "sh" but it doesn't seem to change the main problem, where it's trying to annotate the CDS file. Here's some output from the log file:

*Fri Aug 18 11:47:52 EDT 2023Pan-genome Extensive de-novo TE Annotator (panEDTA) Output directory: /nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/panEDTA/comparative_brassicaceae_run_8-7-2023 Genome files: genome_list_paths_updated.txt Coding sequences: ../Araport11_cds_20220914 Curated library: Copy number cutoff: 3 CPUs: 16Fri Aug 18 11:47:52 EDT 2023De novo annotate genome Araport11_cds_20220914 with EDTA############################################################# Extensive de-novo TE Annotator (EDTA) v2.1.3 ######### Shujun Ou @. @.>) ############################################################Fri Aug 18 11:47:55 EDT 2023 Dependency checking: All passed!Fri Aug 18 11:48:01 EDT 2023 The longest sequence ID in the genome contains 378 characters, which is longer than the limit (13) Trying to reformat seq IDs... Attempt 1...Fri Aug 18 11:48:02 EDT 2023 Seq ID conversion successful! A CDS file Araport11_cds_20220914 is provided via --cds. Please make sure this is the DNA sequence of coding regions only.*

This is the genome list:

/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Euclidium_genome/GCA_900116095.1_Euclidium_syriacum.MPIPZ.v1_genomic.fna ../Araport11_cds_20220914/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/C_violacea/Cviolacea_585_v2.0.fa /nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/C_violacea/Cviolacea_585_v2.1.cds_primaryTranscriptOnly.fa/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Crubella/Crubella_474_v1.fa /nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Crubella/Crubella_474_v1_names_shortened.1.cds.fa/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Brapa/BrapaFPsc_277_v1.fa /nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Brapa/BrapaFPsc_277_v1.3.cds_primaryTranscriptOnly.fa/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Alyrata/Alyrata_384_v1.fa /nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Alyrata/Alyrata_384_v2.1.cds_primaryTranscriptOnly.fa/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/M_pygmaea/M_pygmaea_names_fixed.genome.fasta ../Araport11_cds_20220914/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Aalpina/Arabis_alpina.MPIPZ.version_5.1.chr.all.fasta ../Araport11_cds_20220914/nfs/turbo/rsbaucom/lab/Hesperis_Dovetail/Hi-Rise_Assembly_September_2022/EDTA_TE_annotation/Hesperis_assembly.fasta /nfs/turbo/rsbaucom/lab/Hesperis_Dovetail/Hi-Rise_Assembly_September_2022/BRAKER3_gene_annotation/RNA_protein/braker/braker.codingseq/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Dsophioides/Dsophioides_482_v1_short_names.fa /nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Dsophioides/Dsophioides_482_v1.1.cds_primaryTranscriptOnly.fa/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Dstrictus/Dstrictus_582_v2.0.fa /nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Dstrictus/Dstrictus_582_v2.1.cds_primaryTranscriptOnly.fa/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Mperfoliatum/Mperfoliatum_583_v2.0.fa /nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Mperfoliatum/Mperfoliatum_583_v2.1.cds_primaryTranscriptOnly.fa As suggested, each genome has the genome first and either its own CDS or the Arabidopsis CDS on the same line.

Here's the command I ran:

source /home/jlrifkin/setup_conda.shconda activate EDTAbash /nfs/turbo/rsbaucom/lab/SOFTWARE/EDTA/panEDTA.sh -g genome_list_paths_updated.txt -c ../Araport11_cds_20220914 -t 16 -f 3

I tried removing the -c option, but that just throws an error (Failed to parse command line / line 105: [: !=: unary operator expected Option cds requires an argument ERROR: Initial EDTA failed for Araport11_cds_20220914). It really seems like it's trying to annotate the CDS rather than using it as CDS.

Thanks!

Joanna

On Fri, Aug 18, 2023 at 11:06 AM Joanna R. @.***> wrote:

Hi Shujun,

Thanks, I'll give that a try.

Cheers,

Joanna

On Tue, Aug 15, 2023 at 11:31 PM Shujun Ou @.***> wrote:

Hi Joanna,

The easiest "fix" is to run the code with bash, not sh, zsh or other shell variants: bash panEDTA.sh ...

Let me know if you still have trouble running it. I will also try to update the code and make it more adaptive to shell variants.

Thanks! Shujun

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/375#issuecomment-1679907362, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CUIDBRPJGK7BOWEISTXVQ5KTANCNFSM6AAAAAA26PSLZQ . You are receiving this because you authored the thread.Message ID: @.***>

-- Joanna Rifkin PhD they/them Postdoctoral fellow at the University of Michigan

-- Joanna Rifkin PhD they/them Postdoctoral fellow at the University of Michigan

oushujun commented 10 months ago

Hi Joanna,

Sorry for the delay. This version passes tests on my end with bash, please try it out and let me know. Let me know if you have any suggestions. Thank you!

panEDTA.sh.txt

Shujun

joannarifkin commented 10 months ago

Hi Shujun,

Good news! It's not trying to annotate TEs in CDS any more.

It doesn't seem to be able to locate CDS with a path outside of the directory it's being run in, so I put local symlinks to every CDS - not a problem really but perhaps something to know about. If I used the actual, functioning path, it said the CDS didn't exist, but if I made a symlink to the directory I'm running it in it was fine.

I'll keep you updated!

Thanks for helping troubleshoot!

All the best,

Joanna

On Mon, Aug 21, 2023 at 1:26 AM Shujun Ou @.***> wrote:

Hi Joanna,

Sorry for the delay. This version passes tests on my end with bash, please try it out and let me know. Let me know if you have any suggestions. Thank you!

panEDTA.sh.txt https://github.com/oushujun/EDTA/files/12392139/panEDTA.sh.txt

Shujun

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/375#issuecomment-1685664209, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CWQELELAJ6AJTEFBXLXWLWRRANCNFSM6AAAAAA26PSLZQ . You are receiving this because you authored the thread.Message ID: @.***>

-- Joanna Rifkin PhD they/them Postdoctoral fellow at the University of Michigan

joannarifkin commented 9 months ago

Hi again,

I think this is still possibly an issue: when I try to run panEDTA with symlinks to completed EDTA runs in other directories, it makes this non-functioning symlink for every genome:

Crubella_474_v1.fa.mod.EDTA.TElib.novel.fa -> /nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Crubella/Crubella_474_v1.fa.mod.EDTA.TElib.novel.fa

It apparently completes, but actually doesn't reannotate the genomes with the panEDTA library, but instead with the same TEs as before, and prints the following error many times in the error log:

grep: BrapaFPsc_277_v1.fa.mod.EDTA.TElib.novel.fa: No such file or directory

If I run the same set of genomes locally from scratch, I don't have this problem. I haven't tried copying the previous runs into the same directory where I'm running panEDTA.

I'm just rerunning it with all the genomes in the same place, but was hoping to avoid doing that to save time (and because one of the genomes is large and highly repetitive).

Let me know if I can help debug this with any additional data.

Thanks!

Joanna Thanks!

Joanna

oushujun commented 4 months ago

Hi Joanna,

I finally got this updated. Can you please update panEDTA (or the entire EDTA repo) and try the symlinks again? I tested locally and now it works with either the sh or bash way of running it.

Shujun

joannarifkin commented 4 months ago

Hi Shujun,

Thanks! I'm just running panEDTA and doing everything de novo in sequence, and that seems to be working fine. I figured since everything needed to be redone with EDTA2 anyway I could just do it the slow way. But I'll try the updated version next time and I'm excited about the efficiency!

Cheers,

Joanna

On Mon, Feb 19, 2024 at 1:28 AM Shujun Ou @.***> wrote:

Hi Joanna,

I finally got this updated. Can you please update panEDTA (or the entire EDTA repo) and try the symlinks again? I tested locally and now it works with either the sh or bash way of running it.

Shujun

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/375#issuecomment-1951786801, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CQ3OR7LHZICQSCQJS3YULWKNAVCNFSM6AAAAAA26PSLZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJRG44DMOBQGE . You are receiving this because you authored the thread.Message ID: @.***>

-- Joanna Rifkin PhD they/them Computational biologist

oushujun commented 3 months ago

If the issue is resolved, I will close this thread. Please reopen or open a new thread if you have different issues.

Thank you for your patience! Shujun

joannarifkin commented 3 months ago

Great, thank you!

On Mon, Mar 18, 2024 at 4:38 PM Shujun Ou @.***> wrote:

If the issue is resolved, I will close this thread. Please reopen or open a new thread if you have different issues.

Thank you for your patience! Shujun

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/375#issuecomment-2004944167, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CXGYTXR47U4YAATNPTYY5GF7AVCNFSM6AAAAAA26PSLZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBUHE2DIMJWG4 . You are receiving this because you authored the thread.Message ID: @.***>

-- Joanna Rifkin PhD they/them Computational biologist