oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
315 stars 70 forks source link

SLURM-specific behavior: repeat content dramatically reduced but no errors #392

Open laramiemckenna opened 9 months ago

laramiemckenna commented 9 months ago

Our cluster moved from an LSF to a SLURM workload manager this year. I really enjoy using EDTA for our genome projects and ran it on a couple of assemblies while still working under LSF. Our total repeat content, as expected, was always in the same range as previous estimates from short-read data.

After the switch, I decided to re-run it on those same assemblies after making some small changes (fixing some minor SVs here and there) -- same install, same version as before, same script, same number of resources. But, all of a sudden, the total repeat content was dropping substantially and I can't find any indication as to why in the log files or .err/.out files. This pattern has held for multiple genomes across multiple accounts regardless of version or install type. Our IT Department looked into it and said it was likely something to do with how EDTA uses resources in SLURM, but could not find the root of the problem either. It was especially confusing that EDTA wasn't using all of the resources allocated to it.

I don't know how to address this or begin troubleshooting it. Do you have any idea what might be causing this behavior?

I've placed some examples from one of the genomes I'm working on below if it helps. Again, there's nothing in the .log, .err. or .out files -- according to those, it looks like the job was completed successfully.

For this genome, the expected total repeat content is ~30-33% based on previous runs of EDTA and estimates with GenomeScope.

This was the first attempt using the same script and install as it was prior to the LSF/SLURM switch (values in this range of 6-7% also occurred if I used the same resources, switched --ntasks to -c, and used Singularity instead):

#!/bin/sh
#SBATCH -e edta_test_%j.err
#SBATCH -o edta_test_%j.out
#SBATCH --job-name=edta_test
#SBATCH --time-min=120:00:00
#SBATCH --ntasks=25
#SBATCH --mem=80G
#SBATCH --partition=plant
#SBATCH --nodes=1

perl ~/mambaforge/envs/edta/bin/EDTA.pl --genome hap2_curated.FINAL.fasta --species others --anno 1 -t 25

and this is the SLURM output:

Job         1312854 (COMPLETED)
Name        edta_test
Submit      sbatch edta.sh
Nodes       plant - plant02
Input       /dev/null
Output      [path to]/edta_test_1312854.out
Error       [path to]/edta_test_1312854.err
Resources   CPU = 25 Memory = 81920
Start       2023-08-01 13:37:40
End         2023-08-01 18:03:34
Elapsed     265.9 minutes
Limit       28800 minutes
Exit Code       SUCCESS (0)

Usage:
min         CPU = 89437.26 sec (1 day, 0:50:37.26, 22.42 %)
min         Mem = 13133.449 MB (16.03 %)
max         CPU = 89437.26 sec (1 day, 0:50:37.26, 22.42 %)
max         Mem = 13133.449 MB (16.03 %)
average         CPU = 89437.26 sec (1 day, 0:50:37.26, 22.42 %)
average         Mem = 13133.449 MB (16.03 %)
total           CPU = 89437.26 sec (1 day, 0:50:37.26, 22.42 %)
total           Mem = 13133.449 MB (16.03 %)

and here's the EDTA output

Repeat Classes
==============
Total Sequences: 9
Total Length: 298741932 bp
Class                  Count        bpMasked    %masked
=====                  =====        ========     =======
LTR                    --           --           --
    Copia              5290         7521628  2.52%
    Gypsy              2279         2886915  0.97%
    unknown            2123         1271613  0.43%
TIR                    --           --           --
    CACTA              4085         2304875  0.77%
    Mutator            6566         2950914  0.99%
    PIF_Harbinger      1010         424154   0.14%
    Tc1_Mariner        144          83628        0.03%
    hAT                3496         2342891  0.78%
nonTIR                 --           --           --
    helitron           1595         830065   0.28%
                      ---------------------------------
    total interspersed 26588        20616683     6.90%

---------------------------------------------------------
Total                  26588        20616683     6.90%

Even though it wasn't using all of the memory provided to it, I wondered if it was a matter of resource allocation, so after slowly increasing it (especially the number of tasks-per-cpu), I was able to reproduce the total repeat content and ratios I expected with this run, but scaling the resources similarly for other larger genomes did not work:

#!/bin/sh
#SBATCH -e edta_singularity_%j.err
#SBATCH -o edta_singularity_%j.out
#SBATCH --job-name=edta_singularity
#SBATCH --time-min=120:00:00
#SBATCH -c 100
#SBATCH --mem=300G
#SBATCH --partition=plant
#SBATCH --nodes=1

module load cluster/singularity/3.11.0

export PYTHONNOUSERSITE=1

singularity exec [path to]/EDTA.sif EDTA.pl --genome hap2_curated.FINAL.fasta --anno 1

Here's the SLURM job output (again, not actually using much of the resources allocated):

Job         1322805 (COMPLETED)
Name        edta_singularity
Submit      sbatch edta.sh
Nodes       plant - plant01
Input       /dev/null
Output      [path to]/edta_singularity_1322805.out
Error       [path to]/edta_singularity_1322805.err
Resources   CPU = 100 Memory = 307200
Start       2023-08-04 11:47:48
End         2023-08-04 19:17:14
Elapsed     449.43 minutes
Limit       28800 minutes
Exit Code       SUCCESS (0)

Usage:
min         CPU = 60512.09 sec (16:48:32.09, 2.24 %)
min         Mem = 12943.504 MB (4.21 %)
max         CPU = 60512.09 sec (16:48:32.09, 2.24 %)
max         Mem = 12943.504 MB (4.21 %)
average         CPU = 60512.09 sec (16:48:32.09, 2.24 %)
average         Mem = 12943.504 MB (4.21 %)
total           CPU = 60512.09 sec (16:48:32.09, 2.24 %)
total           Mem = 12943.504 MB (4.21 %)

And finally, the EDTA .sum file output:

Repeat Classes
==============
Total Sequences: 9
Total Length: 298741932 bp
Class                  Count        bpMasked    %masked
=====                  =====        ========     =======
LTR                    --           --           --
    Copia              35182        32118649     10.75%
    Gypsy              19775        17604895     5.89%
    unknown            13685        5906158  1.98%
TIR                    --           --           --
    CACTA              27395        11192622     3.75%
    Mutator            42888        14389429     4.82%
    PIF_Harbinger      6605         2014056  0.67%
    Tc1_Mariner        763          218663   0.07%
    hAT                22646        10385771     3.48%
nonTIR                 --           --           --
    helitron           11927        3650322  1.22%
                      ---------------------------------
    total interspersed 180866       97480565     32.63%

---------------------------------------------------------
Total                  180866       97480565     32.63%
oushujun commented 9 months ago

I notice you switch from conda to singularity while increasing the memory allocation. The two may use different versions of EDTA, Repeatmasker, and rmblast. You may want to check the version of these packages between the two installation..

Shujun

laramiemckenna commented 9 months ago

@oushujun -- I'm sorry I didn't clarify this in the original post, but I've tried both versions available via singularity and the current version available via conda. The only thing that has worked (tentatively) is increasing cpus-per-task and sometimes memory allocation, but EDTA is not actually using all of the resources allocated (4.21% in the run above) and I can't replicate this success on larger genomes.

oushujun commented 9 months ago

Some processes in EDTA is singled threaded and could be slow in some genomes if this is your question. As long as it finishes without errors it should be fine. You need to use the latest version of EDTA though which is not in singularity.

Shujun

On Wed, Sep 27, 2023 at 11:41 AM Laramie McKenna Akozbek < @.***> wrote:

@oushujun https://github.com/oushujun -- I'm sorry I didn't clarify this in the original post, but I've tried both versions available via singularity and the current version available via conda. The only thing that has worked (tentatively) is increasing cpus-per-task and sometimes memory allocation, but EDTA is not actually using all of the resources allocated (4.21% in the run above) and I can't replicate this success on larger genomes.

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/392#issuecomment-1737649571, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NF64SKWC7HPYO7XR53X4RCJXANCNFSM6AAAAAA5HVYC4M . You are receiving this because you were mentioned.Message ID: @.***>

laramiemckenna commented 9 months ago

@oushujun -- I have also used the most recent version. EDTA finishes without error in every case, even when it's estimating a total repeat content of 6% for a known 30-33% genome or 15% for a known 80% genome.

laramiemckenna commented 9 months ago

@oushujun -- I just wanted to follow-up on this. Do you have any guesses as to what might be causing the behavior I described above?

oushujun commented 9 months ago

@laramiemckenna I am not sure. I have not seen this behavior before. Even for the small Arabidopsis genome, the EDTA annotation is reasonable and captures the major numbers. If you don't see any errors, I don't know what may go wrong. Did you test on the rice or Arabidopsis genome?

laramiemckenna commented 8 months ago

@oushujun I was not sure what the expected output was for the test data, but I did run it on Arabidopsis TAIR10.1 using the same parameters as I did for the run that was successful above (the second example in the original issue). This is what I got compared to the expected amount of ~21%. I'm extra confused by this because these same exact parameters/version/image were used for the run where it was somewhat successful, but this one wasn't successful.

Script

#!/bin/sh
#SBATCH -e edta_singularity_%j.err
#SBATCH -o edta_singularity_%j.out
#SBATCH --job-name=edta_singularity
#SBATCH --time-min=120:00:00
#SBATCH -c 100
#SBATCH --mem=300G
#SBATCH --partition=plant
#SBATCH --nodes=1

module load cluster/singularity/3.11.0

export PYTHONNOUSERSITE=1

singularity exec [path to]/EDTA.sif EDTA.pl --genome GCA_000001735.2_TAIR10.1_genomic.fna --anno 1

SLURM job output:

Job         1514574 (COMPLETED)
Name        edta_singularity
Submit      sbatch edta.sh
Nodes       plant - plant02
PWD         [path to]/arabi_edta_test
Input       /dev/null
Output      [path to]/arabi_edta_test/edta_singularity_1514574.out
Error       [path to]/arabi_edta_test/edta_singularity_1514574.err
Resources   CPU = 100 Memory = 307200
Start       2023-10-03 09:31:32
End         2023-10-03 11:58:44
Elapsed     147.2 minutes
Limit       28800 minutes
Exit Code       SUCCESS (0)

Usage:
min         CPU = 19205.11 sec (5:20:05.11, 2.17 %)
min         Mem = 5237.469 MB (1.7 %)
max         CPU = 19205.11 sec (5:20:05.11, 2.17 %)
max         Mem = 5237.469 MB (1.7 %)
average         CPU = 19205.11 sec (5:20:05.11, 2.17 %)
average         Mem = 5237.469 MB (1.7 %)
total           CPU = 19205.11 sec (5:20:05.11, 2.17 %)
total           Mem = 5237.469 MB (1.7 %)

Summary Output:

Repeat Classes
==============
Total Sequences: 7
Total Length: 119482896 bp
Class                  Count        bpMasked    %masked
=====                  =====        ========     =======
LTR                    --           --           --
    Copia              786          925858   0.77%
    Gypsy              1634         2410885  2.02%
    unknown            405          352244   0.29%
TIR                    --           --           --
    CACTA              589          405463   0.34%
    Mutator            1364         792585   0.66%
    PIF_Harbinger      284          150803   0.13%
    Tc1_Mariner        23           27241        0.02%
    hAT                237          105587   0.09%
nonTIR                 --           --           --
    helitron           3066         1818477  1.52%
                      ---------------------------------
    total interspersed 8388         6989143  5.85%

---------------------------------------------------------
Total                  8388         6989143  5.85%
laramiemckenna commented 8 months ago

Below is the output of the test run if that helps!

Script (using same parameters)

#!/bin/sh
#SBATCH -e edta_singularity_%j.err
#SBATCH -o edta_singularity_%j.out
#SBATCH --job-name=edta_singularity
#SBATCH --time-min=120:00:00
#SBATCH -c 100
#SBATCH --mem=300G
#SBATCH --partition=plant
#SBATCH --nodes=1

module load cluster/singularity/3.11.0

export PYTHONNOUSERSITE=1

singularity exec [path to]/EDTA.sif EDTA.pl --genome genome.fa --cds genome.cds.fa --curatedlib ../database/rice6.9.5.liban --exclude genome.exclude.bed --overwrite 1 --sensitive 1 --anno 1 --evaluate 1 --threads 10

SLURM Job Output

Job         1515039 (COMPLETED)
Name        edta_singularity
Nodes       plant - plant02
Command     [path to]/test/test_edta.sh
PWD         [path to]/test
Input       /dev/null
Output      [path to]/edta_singularity_1515039.out
Error       [path to]/edta_singularity_1515039.err
CPU         nodes = 1 cpus = 100 tasks = 1
TRES        cpu=100,mem=300G,node=1,billing=100
Start       2023-10-03 13:33:09
End         2023-10-03 13:36:25
Elapsed     7.77 minutes
Limit       28800 minutes

Summary Output:

Repeat Classes
==============
Total Sequences: 1
Total Length: 1000000 bp
Class                  Count        bpMasked    %masked
=====                  =====        ========     =======
LTR                    --           --           --
    Copia              13           18315        1.83%
    Gypsy              46           107087   10.71%
    TRIM               1            129          0.01%
    unknown            1            248          0.02%
TIR                    --           --           --
    CACTA              24           20363        2.04%
    Mutator            110          47775        4.78%
    PIF_Harbinger      110          27512        2.75%
    Tc1_Mariner        124          48718        4.87%
    hAT                34           13891        1.39%
    unknown            15           2972         0.30%
nonLTR                 --           --           --
    LINE_element       28           10614        1.06%
    SINE_element       11           2329         0.23%
nonTIR                 --           --           --
    helitron           81           57826        5.78%
                      ---------------------------------
    total interspersed 598          357779   35.78%

---------------------------------------------------------
Total                  598          357779   35.78%

Error File:

/opt/conda/lib/python3.6/site-packages/Bio/Seq.py:2983: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
  BiopythonWarning,
2023-10-03 13:35:58,608 -INFO- HMM scanning against `/opt/conda/lib/python3.6/site-packages/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm`
2023-10-03 13:35:58,642 -INFO- Creating server instance (pp-1.6.4.4)
2023-10-03 13:35:58,642 -INFO- Running on Python 3.6.13 linux
2023-10-03 13:35:59,080 -INFO- pp local server started with 10 workers
2023-10-03 13:35:59,097 -INFO- Task 0 started
2023-10-03 13:35:59,098 -INFO- Task 1 started
2023-10-03 13:35:59,098 -INFO- Task 2 started
2023-10-03 13:35:59,098 -INFO- Task 3 started
2023-10-03 13:35:59,098 -INFO- Task 4 started
2023-10-03 13:35:59,099 -INFO- Task 5 started
2023-10-03 13:35:59,099 -INFO- Task 6 started
2023-10-03 13:35:59,099 -INFO- Task 7 started
2023-10-03 13:35:59,099 -INFO- Task 8 started
2023-10-03 13:35:59,100 -INFO- Task 9 started
2023-10-03 13:35:59,730 -INFO- generating gene anntations
2023-10-03 13:35:59,748 -INFO- 2 sequences classified by HMM
2023-10-03 13:35:59,748 -INFO- see protein domain sequences in `genome.cds.fa.code.rexdb.dom.faa` and annotation gff3 file in `genome.cds.fa.code.rexdb.dom.gff3`
2023-10-03 13:35:59,748 -INFO- classifying the unclassified sequences by searching against the classified ones
2023-10-03 13:35:59,761 -INFO- using the 80-80-80 rule
2023-10-03 13:35:59,761 -INFO- run CMD: `makeblastdb -in ./tmp/pass1_classified.fa -dbtype nucl`
2023-10-03 13:35:59,827 -INFO- run CMD: `blastn -query ./tmp/pass1_unclassified.fa -db ./tmp/pass1_classified.fa -out ./tmp/pass1_unclassified.fa.blastout -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs qcovhsp sstrand' -num_threads 10`
2023-10-03 13:35:59,940 -INFO- 1 sequences classified in pass 2
2023-10-03 13:35:59,940 -INFO- total 3 sequences classified.
2023-10-03 13:35:59,940 -INFO- see classified sequences in `genome.cds.fa.code.rexdb.cls.tsv`
2023-10-03 13:35:59,940 -INFO- writing library for RepeatMasker in `genome.cds.fa.code.rexdb.cls.lib`
2023-10-03 13:35:59,949 -INFO- writing classified protein domains in `genome.cds.fa.code.rexdb.cls.pep`
2023-10-03 13:35:59,951 -INFO- Summary of classifications:
Order           Superfamily  # of Sequences# of Clade Sequences    # of Clades# of full Domains
LTR             Gypsy                         1              1              1              0
Maverick        unknown                       2              0              0              0
2023-10-03 13:35:59,952 -INFO- Pipeline done.
2023-10-03 13:35:59,952 -INFO- cleaning the temporary directory ./tmp
Tue Oct  3 13:36:11 CDT 2023    Homology-based annotation of TEs using genome.fa.mod.EDTA.TElib.fa from scratch.

Out File:

Tue Oct  3 13:34:43 CDT 2023    EDTA advance filtering finished.

Tue Oct  3 13:34:43 CDT 2023    Perform EDTA final steps to generate a non-redundant comprehensive TE library:

                                Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods.

Tue Oct  3 13:35:58 CDT 2023    Clean up TE-related sequences in the CDS file with TEsorter:

                                Remove CDS-related sequences in the EDTA library:

Tue Oct  3 13:36:05 CDT 2023    Combine the high-quality TE library rice6.9.5.liban with the EDTA library:

Tue Oct  3 13:36:11 CDT 2023    EDTA final stage finished! You may check out:
                                The final EDTA TE library: genome.fa.mod.EDTA.TElib.fa
                                Family names of intact TEs have been updated by rice6.9.5.liban: genome.fa.mod.EDTA.intact.gff3
                                Comparing to the provided library, EDTA found these novel TEs: genome.fa.mod.EDTA.TElib.novel.fa
                                The provided library has been incorporated into the final library: genome.fa.mod.EDTA.TElib.fa

Tue Oct  3 13:36:11 CDT 2023    Perform post-EDTA analysis for whole-genome annotation:

Tue Oct  3 13:36:17 CDT 2023    TE annotation using the EDTA library has finished! Check out:
                                Whole-genome TE annotation (total TE: 35.78%): genome.fa.mod.EDTA.TEanno.gff3
                                Whole-genome TE annotation summary: genome.fa.mod.EDTA.TEanno.sum
                                Low-threshold TE masking for MAKER gene annotation (masked: 16.47%): genome.fa.mod.MAKER.masked

Tue Oct  3 13:36:17 CDT 2023    Evaluate the level of inconsistency for whole-genome TE annotation (slow step):

Tue Oct  3 13:36:25 CDT 2023    Evaluation of TE annotation finished! Check out these files:

                                Overall: genome.fa.mod.EDTA.TE.fa.stat.all.sum
                                Nested: genome.fa.mod.EDTA.TE.fa.stat.nested.sum
                                Non-nested: genome.fa.mod.EDTA.TE.fa.stat.redun.sum
oushujun commented 8 months ago

@laramiemckenna sorry, I also don't understand why you have this low % of TE in Arabidopsis. The only abnormal thing I see is the use of the singularity version, which is old and outdated. You may want to try the conda version instead and use the latest github code.

rjohnson-ha commented 8 months ago

@oushujun The EDTA.yml file for the conda installation still specifies EDTA 2.0.1 but the rest of the repo appear to be much newer (2.1.3). Is there a newer version of this yaml file available, or details on how to mix your conda installation instructions with the newer code in the repo?

laramiemckenna commented 8 months ago

@oushujun -- do you mean that I should use the 2.1.0 version and use EDTA.pl through the current repository, which is 2.1.3?

oushujun commented 8 months ago

Yes!

On Fri, Oct 13, 2023 at 12:11 PM Laramie McKenna Akozbek < @.***> wrote:

@oushujun https://github.com/oushujun -- do you mean that I should use the 2.1.0 version and use EDTA.pl through the current repository, which is 2.1.3?

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/392#issuecomment-1761766467, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NCUST7VB4X5XZGBNIDX7FRZXAVCNFSM6AAAAAA5HVYC4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRRG43DMNBWG4 . You are receiving this because you were mentioned.Message ID: @.***>