oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
346 stars 73 forks source link

EDTA fails after REXdb and before the classifying pipeline #242

Closed MafaldaSFerreira closed 2 years ago

MafaldaSFerreira commented 2 years ago

Hi,

EDTA is failing right when the classifying pipeline initiates. It gives me a few erros and fails. There are so many errors that I can't tell exactly what the problem is... It seems it can't find a few files.

...
2021-12-01 03:23:07,177 -INFO- Start classifying pipeline
Traceback (most recent call last):
  File "/home/mafaldaf/.conda/envs/EDTA/bin/TEsorter", line 10, in <module>
    sys.exit(main())
  File "/home/mafaldaf/.conda/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 1014, in main
    pipeline(Args())
  File "/home/mafaldaf/.conda/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 155, in pipeline
    seq_num = len([1 for rc in SeqIO.parse(args.sequence, 'fasta')])
  File "/home/mafaldaf/.conda/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 155, in <listcomp>
    seq_num = len([1 for rc in SeqIO.parse(args.sequence, 'fasta')])
  File "/home/mafaldaf/.conda/envs/EDTA/lib/python3.6/site-packages/Bio/SeqIO/Interfaces.py", line 73, in __next__
    return next(self.records)
  File "/home/mafaldaf/.conda/envs/EDTA/lib/python3.6/site-packages/Bio/SeqIO/FastaIO.py", line 198, in iterate
    for title, sequence in SimpleFastaParser(handle):
  File "/home/mafaldaf/.conda/envs/EDTA/lib/python3.6/site-packages/Bio/SeqIO/FastaIO.py", line 47, in SimpleFastaParser
    for line in handle:
  File "/home/mafaldaf/.conda/envs/EDTA/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 3: invalid start byte
cat: Clupea_harengus.Ch_v2.0.2.cds.all.fa.gz.code.rexdb.cls.tsv: No such file or directory
awk: fatal: cannot open file `CS4.hap1.fa.mod.EDTA.raw.fa.out' for reading (No such file or directory)
awk: fatal: cannot open file `CS4.hap1.fa.mod.EDTA.raw.fa.out' for reading (No such file or directory)
                Warning: No TE-related CDS found (Clupea_harengus.Ch_v2.0.2.cds.all.fa.gz.code.TE empty). Will not use the self-cleaning step.

    Input file "CS4.hap1.fa.mod.EDTA.raw.fa.masked" not found!

        Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
    Options:
        -misschar   [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
        -Nscreen    [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
        -nc     [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
        -nr     [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
        -minlen     [int]   Minimum sequence length filter after clean up; default: 100 (bp)
        -maxlen     [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
        -cleanN     [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
        -cleanT     [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
        -minrm      [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
        -trf        [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
        -trf_path   path    Path to the trf program

[edta_23341061_1.err.txt](https://github.com/oushujun/EDTA/files/7634286/edta_23341061_1.err.txt)

    Input file "CS4.hap1.fa.mod.EDTA.intact.fa.raw.masked" not found!

        Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
    Options:
        -misschar   [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
        -Nscreen    [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
        -nc     [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
        -nr     [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
        -minlen     [int]   Minimum sequence length filter after clean up; default: 100 (bp)
        -maxlen     [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
        -cleanN     [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
        -cleanT     [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
        -minrm      [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
        -trf        [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
        -trf_path   path    Path to the trf program

Input not found!
ERROR: Final TE library not found in CS4.hap1.fa.mod.EDTA.TElib.fa at /proj/snic2020-2-19/private/herring/users/mafalda/software/EDTA/EDTA.pl line 569.

There are also a few more error before this step, but that didn't cause EDTA to stop. In case those are important, I am attaching the entire .err message from the slurm run. edta_23341061_1.err.txt

This is how I call EDTA:

conda activate EDTA
module load bioinfo-tools RepeatMasker/4.1.0 RepeatModeler/2.0.1

GENOME=$(ls *.hap1.fa | sed -n ${SLURM_ARRAY_TASK_ID}p)

perl /proj/snic2020-2-19/private/herring/users/mafalda/software/EDTA/EDTA.pl --genome ${GENOME} --cds Clupea_harengus.Ch_v2.0.2.cds.all.fa.gz --overwrite 1 --sensitive 1 --anno 1 --evaluate 1 --threads 16 

I have tried to simply restart the run using --overwrite 0, but the run fails in the same way.

Thank you very much for your help, Mafalda

oushujun commented 2 years ago

Hi Mafalda,

Looks like the program is not correctly installed. You may want to test it with the small file. Please find more info in readme.

Best, Shujun

On Wed, Dec 1, 2021 at 7:43 AM Mafalda S. Ferreira @.***> wrote:

Hi,

EDTA is failing right when the classifying pipeline initiates. It gives me a few erros and fails. There are so many errors that I can't tell exactly what the problem is... It seems it can't find a few files.

... 2021-12-01 03:23:07,177 -INFO- Start classifying pipeline Traceback (most recent call last): File "/home/mafaldaf/.conda/envs/EDTA/bin/TEsorter", line 10, in sys.exit(main()) File "/home/mafaldaf/.conda/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 1014, in main pipeline(Args()) File "/home/mafaldaf/.conda/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 155, in pipeline seq_num = len([1 for rc in SeqIO.parse(args.sequence, 'fasta')]) File "/home/mafaldaf/.conda/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 155, in seq_num = len([1 for rc in SeqIO.parse(args.sequence, 'fasta')]) File "/home/mafaldaf/.conda/envs/EDTA/lib/python3.6/site-packages/Bio/SeqIO/Interfaces.py", line 73, in next return next(self.records) File "/home/mafaldaf/.conda/envs/EDTA/lib/python3.6/site-packages/Bio/SeqIO/FastaIO.py", line 198, in iterate for title, sequence in SimpleFastaParser(handle): File "/home/mafaldaf/.conda/envs/EDTA/lib/python3.6/site-packages/Bio/SeqIO/FastaIO.py", line 47, in SimpleFastaParser for line in handle: File "/home/mafaldaf/.conda/envs/EDTA/lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 3: invalid start byte cat: Clupea_harengus.Ch_v2.0.2.cds.all.fa.gz.code.rexdb.cls.tsv: No such file or directory awk: fatal: cannot open file CS4.hap1.fa.mod.EDTA.raw.fa.out' for reading (No such file or directory) awk: fatal: cannot open fileCS4.hap1.fa.mod.EDTA.raw.fa.out' for reading (No such file or directory) Warning: No TE-related CDS found (Clupea_harengus.Ch_v2.0.2.cds.all.fa.gz.code.TE empty). Will not use the self-cleaning step.

Input file "CS4.hap1.fa.mod.EDTA.raw.fa.masked" not found!

    Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa

Options: -misschar [n|l] Define the letter representing unknown sequences; default: n. l: recognize lower case letters -Nscreen [0|1] Enable (1) or disable (0) the -nc parameter; default: 1 -nc [int] Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0 -nr [0-1] Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1 -minlen [int] Minimum sequence length filter after clean up; default: 100 (bp) -maxlen [int] Maximum sequence length filter after clean up; default: 25000 (bp) -cleanN [0|1] Retain (0) or remove (1) the -misschar taget in output sequence; default: 0 -cleanT [0|1] Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0). -minrm [int] The minimum length of -misschar to be removed if -cleanN 1; default: 1. -trf [0|1] Enable (1) or disable (0) tandem repeat finder (trf); default: 1 -trf_path path Path to the trf program

edta_23341061_1.err.txt

Input file "CS4.hap1.fa.mod.EDTA.intact.fa.raw.masked" not found!

    Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa

Options: -misschar [n|l] Define the letter representing unknown sequences; default: n. l: recognize lower case letters -Nscreen [0|1] Enable (1) or disable (0) the -nc parameter; default: 1 -nc [int] Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0 -nr [0-1] Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1 -minlen [int] Minimum sequence length filter after clean up; default: 100 (bp) -maxlen [int] Maximum sequence length filter after clean up; default: 25000 (bp) -cleanN [0|1] Retain (0) or remove (1) the -misschar taget in output sequence; default: 0 -cleanT [0|1] Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0). -minrm [int] The minimum length of -misschar to be removed if -cleanN 1; default: 1. -trf [0|1] Enable (1) or disable (0) tandem repeat finder (trf); default: 1 -trf_path path Path to the trf program

Input not found! ERROR: Final TE library not found in CS4.hap1.fa.mod.EDTA.TElib.fa at /proj/snic2020-2-19/private/herring/users/mafalda/software/EDTA/EDTA.pl line 569.

There are also a few more error before this step, but that didn't cause EDTA to stop. In case those are important, I am attaching the entire .err message from the slurm run. edta_23341061_1.err.txt https://github.com/oushujun/EDTA/files/7634298/edta_23341061_1.err.txt

This is how I call EDTA:

conda activate EDTA module load bioinfo-tools RepeatMasker/4.1.0 RepeatModeler/2.0.1

GENOME=$(ls *.hap1.fa | sed -n ${SLURM_ARRAY_TASK_ID}p)

perl /proj/snic2020-2-19/private/herring/users/mafalda/software/EDTA/EDTA.pl --genome ${GENOME} --cds Clupea_harengus.Ch_v2.0.2.cds.all.fa.gz --overwrite 0 --sensitive 1 --anno 1 --evaluate 1 --threads 16

Thank you very much for your help, Mafalda

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/242, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NBGWQ5ARYAV4RG4PFDUOYRBFANCNFSM5JEQLYEQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

MafaldaSFerreira commented 2 years ago

Thank you for the feedback Shujun.

I reinstalled EDTA using conda (instructions under "Other ways to install with conda...") without any issue and re-run EDTA with the test data and my genome assembly. The test data works fine, although with a few warnings (I am attaching the log). The genome fails again (also attached) in the same spot with the same error and I am still not sure where the issue is.

Test: EDTA_test_13122021.log

Genome: genome_edta_23602984_1.err.txt

This is how I am calling the genome run for reference:

conda activate EDTA
module load bioinfo-tools RepeatMasker/4.1.0 RepeatModeler/2.0.1

GENOME=$(ls *.hap1.fa | sed -n ${SLURM_ARRAY_TASK_ID}p)

EDTA.pl --genome ${GENOME} --cds Clupea_harengus.Ch_v2.0.2.cds.all.fa.gz --overwrite 1 --sensitive 1 --anno 1 --evaluate 1 --threads 20

I am still not sure what the issue could be. Any idea? It is strange because it seems like in the test the "classifying pipeline" is only run once (I can only see "Start classifying pipeline" one time), but for the genome it is started twice and fails the second time.

Thank you in advance for any help, Mafalda

oushujun commented 2 years ago

You may not need to module load those tools. EDTA has RepeatMasker and RepeatModeler included on the conda package. You may try to run without module load.

Shujun

On Mon, Dec 13, 2021 at 7:28 AM Mafalda S. Ferreira < @.***> wrote:

Thank you for the feedback Shujun.

I reinstalled EDTA using conda (instructions under "Other ways to install with conda...") without any issue and re-run EDTA with the test data and my genome assembly. The test data works fine, although with a few warnings (I am attaching the log). The genome fails again (also attached) in the same spot with the same error and I am still not sure where the issue is.

Test: EDTA_test_13122021.log https://github.com/oushujun/EDTA/files/7704092/EDTA_test_13122021.log

Genome: genome_edta_23602984_1.err.txt https://github.com/oushujun/EDTA/files/7704166/genome_edta_23602984_1.err.txt

This is how I am calling the genome run for reference:

conda activate EDTA module load bioinfo-tools RepeatMasker/4.1.0 RepeatModeler/2.0.1

GENOME=$(ls *.hap1.fa | sed -n ${SLURM_ARRAY_TASK_ID}p)

EDTA.pl --genome ${GENOME} --cds Clupea_harengus.Ch_v2.0.2.cds.all.fa.gz --overwrite 1 --sensitive 1 --anno 1 --evaluate 1 --threads 20

I am still not sure what the issue could be. Any idea? It is strange because it seems like in the test the "classifying pipeline" is only run once (I can only see "Start classifying pipeline" one time), but for the genome it is started twice and fails the second time.

Thank you in advance for any help, Mafalda

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/242#issuecomment-992476740, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NDHKABVYTTJZBZ5AADUQXYHRANCNFSM5JEQLYEQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.