ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
101 stars 13 forks source link

Command terminated by signal 9 #69

Closed eeaunin closed 5 months ago

eeaunin commented 7 months ago

Hello. Some of my FCS-GX runs crash with the error message Command terminated by signal 9. This is a part of the run log that contains the error message:

-----------------------------------------------------------------------------

tax-id    : 339868
fasta     : /sample-volume/assembly.fasta
size      : 18899.96 MiB
split-fa  : True
####### Starting process ['zcat', '-f', '/app/db/gxdb/ea10_gx_mapper_53638/all.blast_div.tsv.gz']
####### Starting process ['grep', '-E', '^339868\t']
####### Cleaning up process ['zcat', '-f', '/app/db/gxdb/ea10_gx_mapper_53638/all.blast_div.tsv.gz']
####### Cleaning up process ['grep', '-E', '^339868\t']
BLAST-div : salamanders
gx-div    : anml:amphibians
w/same-tax: True
bin-dir   : /app/bin
gx-db     : /app/db/gxdb/ea10_gx_mapper_53638/all.gxi
gx-ver    : Nov 27 2023 11:05:36; git:v0.5.0+branch--HEAD
output    : /output-volume//assembly.339868.taxonomy.rpt

-----------------------------------------------------------------------------

####### args: Namespace(fasta='/sample-volume/assembly.fasta', tax_id=339868, species=None, split_fasta=True, div='anml:amphibians', gx_db='/app/db/gxdb/ea10_gx_mapper_53638/all.gxi', mask_transposons=None, bin_dir='/app/bin', allow_same_species=True, ignore_same_kingdom=False, out_basename='/output-volume//assembly.339868', out_dir='/output-volume/', action_report=True, save_hits=False, generate_logfile=False, debug=True, phone_home_label=None, gc_acc=None, gc_genomes_root_dir=None, production_build_name=None, gzip_c='gzip -c', out_taxonomy_rpt='/output-volume//assembly.339868.taxonomy.rpt') 

####### Starting process ['cat', '/sample-volume/assembly.fasta']
####### Starting process ['gzip', '-cdf']
ERROR: ld.so: object '/software/mistral/mistral_2023.1.1_x86_64/dryrun/lib64/libmistral.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
####### Starting process ['/app/bin/gx', 'split-fasta']
####### Starting process ['pv', '-Wbratpe', '--interval=0.5', '--size=19818045242']
ERROR: ld.so: object '/software/mistral/mistral_2023.1.1_x86_64/dryrun/lib64/libmistral.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
####### Starting process ['cat', '/sample-volume/assembly.fasta']
####### Starting process ['gzip', '-cdf']
ERROR: ld.so: object '/software/mistral/mistral_2023.1.1_x86_64/dryrun/lib64/libmistral.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
####### Starting process ['/busybox/time', '-v', 'nice', '-n19', '/app/bin/gx', 'align', '--gx-db=/app/db/gxdb/ea10_gx_mapper_53638/all.gxi', '--repeats-basis-fa=/dev/fd/5']
####### Starting process ['/app/bin/gx', 'taxify', '--gx-db=/app/db/gxdb/ea10_gx_mapper_53638/all.gxi', '--output=/output-volume//assembly.339868.taxonomy.rpt.tmp', '--asserted-div=anml:amphibians', '--db-exclude-locs=/app/bin/db_exclude.locs.tsv']
Using GX_PREFETCH=0
Collecting masking statistics...
Collected masking stats:  19.5708 Gbp; 276.593s; 70.7566 Mbp/s. Baseline: 7.28905

Command terminated by signal 9
        Command being timed: "nice -n19 /app/bin/gx align --gx-db=/app/db/gxdb/ea10_gx_mapper_53638/all.gxi --repeats-basis-fa=/dev/fd/5"
        User time (seconds): 22839.29
        System time (seconds): 4986.44
        Percent of CPU this job got: 279%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2h 45m 42s
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1394422432
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 243375231
        Minor (reclaiming a frame) page faults: 141607974
        Voluntary context switches: 247166939
        Involuntary context switches: 1014181
        Swaps: 0
        File system inputs: 2029139224
        File system outputs: 16
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
####### Cleaning up process ['/busybox/time', '-v', 'nice', '-n19', '/app/bin/gx', 'align', '--gx-db=/app/db/gxdb/ea10_gx_mapper_53638/all.gxi', '--repeats-basis-fa=/dev/fd/5']
Error: Process failed with retcode 9: ['/busybox/time', '-v', 'nice', '-n19', '/app/bin/gx', 'align', '--gx-db=/app/db/gxdb/ea10_gx_mapper_53638/all.gxi', '--repeats-basis-fa=/dev/fd/5'])
####### Cleaning up process ['cat', '/sample-volume/assembly.fasta']
Error: Process failed with retcode -13: ['cat', '/sample-volume/assembly.fasta'])
####### Cleaning up process ['gzip', '-cdf']
Error: Process failed with retcode -13: ['gzip', '-cdf'])
####### Cleaning up process ['/app/bin/gx', 'split-fasta']
Error: Process failed with retcode -13: ['/app/bin/gx', 'split-fasta'])
####### Cleaning up process ['pv', '-Wbratpe', '--interval=0.5', '--size=19818045242']
####### Cleaning up process ['/app/bin/gx', 'taxify', '--gx-db=/app/db/gxdb/ea10_gx_mapper_53638/all.gxi', '--output=/output-volume//assembly.339868.taxonomy.rpt.tmp', '--asserted-div=anml:amphibians', '--db-exclude-locs=/app/bin/db_exclude.locs.tsv']
####### Cleaning up process ['cat', '/sample-volume/assembly.fasta']
####### Cleaning up process ['gzip', '-cdf']

-----------------------------------------------------------------------------

These are the software versions used for this run: OS: Ubuntu 18.04.6 LTS Singularity: v3.10.0 FCS image: 0.5.0 Python: 3.10.11

Platform: LSF

Do you know what has happened here and how to prevent the crash? I can give you more details about the crashed run if needed.

etvedte commented 7 months ago

Hello,

Command terminated by signal 9 means the command was terminated by an external SIGKILL signal, sent by the user or (more likely) the execution environment. Assuming you did not kill the job, you may want to contact your LSF admin to troubleshoot.

Does this occur on all your FCS-GX jobs? Did you get the test dataset/database to run successfully?

eeaunin commented 7 months ago

Hello. Thanks for the reply. I didn't kill the job myself, so something in the execution environment must have triggered SIGKILL. I will ask the LSF administrators about this. This error doesn't occur on all FCS-GX jobs. Most FCS-GX jobs that I submit run successfully but there are some that crash, and it looks like there are multiple unrelated causes for the crashes

eeaunin commented 7 months ago

I have asked the LSF administrators about these crashes. They haven't found any reason why these runs get terminated. In LSF logs, it doesn't look like the jobs go over MEMLIMIT or run time limit. So the issue is still unresolved

etvedte commented 7 months ago

I'm not familiar with LSF job scheduling, but presumably a job would be killed that exceeds a preconfigured quota

In LSF logs, it doesn't look like the jobs go over MEMLIMIT or run time limit.

Can you confirm this and attach any relevant logs? Can you look into the output of bjobs -l <job_id> which I think should have the max-mem quota for the job?

Also does this happen with certain input FASTAs every time? i.e. if you re-run the input for a failed job does it always fail?

eeaunin commented 7 months ago

Below is the LSF log that corresponds to the crashed job above. The FCS-GX run was triggered using Nextflow. I have replaced the real directory paths with /path/to/user/directory and /path/to/run/directory in the log before posting it here. According to the LSF log, Max Memory used was 2469 MB and the quota for Max Memory was 102400.00 MB. So the quota was about 40x higher than the maximum memory use reported by LSF. The LSF job also requested having at least 600 Gb free space available in /tmp before starting the run. You can also see that the run started and finished on the same day, and it was running in the week queue which has the run time limit of one week. So it definitely didn't run out of LSF run time. The LSF logs of all the other FCS-GX runs that crashed with the [Command terminated by signal 9] error look similar to this one. The crashes don't happen every time with the same input FASTA. When repeating a crashed run with exactly the same FASTA file and exactly the same settings (including the same amount of memory requested), the run may crash or may finish successfully: the outcome is unpredictable. It's been two weeks since I contacted the LSF administrators about these crashes but they haven't been able to figure out what causes the crashes either.

------------------------------------------------------------
Sender: LSF System <lsfadmin@tol-1-1-4>
Subject: Job 7954052: <nf-FCSGX> in cluster <tol> Exited

Job <nf-FCSGX> was submitted from host <tol-1-1-3> by user <ea10> in cluster <tol> at Wed Feb 14 03:58:15 2024
Job was executed on host(s) <24*tol-1-1-4>, in queue <week>, as user <ea10> in cluster <tol> at Sat Feb 17 16:31:58 2024
</path/to/user/directory/ea10> was used as the home directory.
</path/to/run/directory/fasta/ascc/20240214_aLisHel1.20231230.haplotigs.split_fasta_0000_asg_cobiont_check_run/work/30/966e6cdbc5202899514a815de592f8> was used as the working directory.
Started at Sat Feb 17 16:31:58 2024
Terminated at Sat Feb 17 22:07:35 2024
Results reported at Sat Feb 17 22:07:35 2024

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
#!/bin/bash
#BSUB -o /path/to/run/directory/fasta/ascc/20240214_aLisHel1.20231230.haplotigs.split_fasta_0000_asg_cobiont_check_run/work/30/966e6cdbc5202899514a815de592f8/.command.log
#BSUB -q week
#BSUB -n 24
#BSUB -R "span[hosts=1]"
#BSUB -M 102400
#BSUB -R "select[mem>=102400] rusage[mem=102400]"
#BSUB -J nf-FCSGX
#BSUB -R "select[tmp>600G] rusage[tmp=600G]"
NXF_CHDIR=/path/to/run/directory/fasta/ascc/20240214_aLisHel1.20231230.haplotigs.split_fasta_0000_asg_cobiont_check_run/work/30/966e6cdbc5202899514a815de592f8
# NEXTFLOW TASK: FCSGX
set -e
set -u
NXF_DEBUG=${NXF_DEBUG:=0}; [[ $NXF_DEBUG > 1 ]] && set -x
NXF_ENTRY=${1:-nxf_main}

nxf_tree() {
    local pid=$1

    declare -a ALL_CHILDREN
    while read P PP;do
        ALL_CHILDREN[$PP]+=" $P"
    done < <(ps -e -o pid= -o ppid=)

    pstat() {
        local x_pid=$1
        local STATUS=$(2> /dev/null < /proc/$1/status grep -E 'Vm|ctxt')

        if [ $? = 0 ]; then
        local  x_vsz=$(echo "$STATUS" | grep VmSize | awk '{print $2}' || echo -n '0')
        local  x_rss=$(echo "$STATUS" | grep VmRSS | awk '{print $2}' || echo -n '0')
        local x_peak=$(echo "$STATUS" | grep -E 'VmPeak|VmHWM' | sed 's/^.*:\s*//' | sed 's/[\sa-zA-Z]*$//' | tr '\n' ' ' || echo -n '0 0')
        local x_pmem=$(awk -v rss=$x_rss -v mem_tot=$mem_tot 'BEGIN {printf "%.0f", rss/mem_tot*100*10}' || echo -n '0')
        local vol_ctxt=$(echo "$STATUS" | grep '\bvoluntary_ctxt_switches' | awk '{print $2}' || echo -n '0')
        local inv_ctxt=$(echo "$STATUS" | grep '\bnonvoluntary_ctxt_switches' | awk '{print $2}' || echo -n '0')
        cpu_stat[x_pid]="$x_pid $x_pmem $x_vsz $x_rss $x_peak $vol_ctxt $inv_ctxt"
        fi
    }
    pwalk() {
        pstat $1
        for i in ${ALL_CHILDREN[$1]:=}; do pwalk $i; done
    }

    pwalk $1
}

nxf_stat() {
    cpu_stat=()
    nxf_tree $1

(... more ...)
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time :                                   16450.00 sec.
    Max Memory :                                 2469 MB
    Average Memory :                             1852.38 MB
    Total Requested Memory :                     102400.00 MB
    Delta Memory :                               99931.00 MB
    Max Swap :                                   -
    Max Processes :                              28
    Max Threads :                                89
    Run time :                                   20136 sec.
    Turnaround time :                            324560 sec.

The output (if any) is above this job summary.

(END)
etvedte commented 7 months ago

GX needs ~470GiB to run. Can you try setting bsub -M 480000 -R "rusage[mem=480000]"? Insufficient memory allocation could be the source of the failed job, and not from FCS-GX "crashing."

eeaunin commented 7 months ago

If GX always needs ~470 GiB to run, how did many runs that I've done finish successfully with much less memory allocated to them? I've done over a hundred runs that didn't crash with 70 Gb or 100 Gb memory assigned to them, and their output looks complete: the final report files have been produced and the report covers the entire assembly. And why do the LSF logs of the crashed FCS-GX runs not show running out of memory? Usually when an LSF job runs out of memory this gets reported in the LSF job's error stream log with the message TERM_MEMLIMIT: job killed after reaching LSF memory usage limit. This message doesn't appear in the logs of the crashed FCS-GX runs. Also, if the crashes are just a matter of memory allocation, why do runs with the exact same assembly FASTA file and the same number of gigabytes of memory assigned in the bsub command sometimes work fine and sometimes crash, when retrying a run multiple times? I will try assigning FCS-GX 470 Gb memory anyway to see if that helps