Closed eeaunin closed 5 months ago
Hello,
Command terminated by signal 9
means the command was terminated by an external SIGKILL signal, sent by the user or (more likely) the execution environment. Assuming you did not kill the job, you may want to contact your LSF admin to troubleshoot.
Does this occur on all your FCS-GX jobs? Did you get the test dataset/database to run successfully?
Hello. Thanks for the reply. I didn't kill the job myself, so something in the execution environment must have triggered SIGKILL. I will ask the LSF administrators about this. This error doesn't occur on all FCS-GX jobs. Most FCS-GX jobs that I submit run successfully but there are some that crash, and it looks like there are multiple unrelated causes for the crashes
I have asked the LSF administrators about these crashes. They haven't found any reason why these runs get terminated. In LSF logs, it doesn't look like the jobs go over MEMLIMIT or run time limit. So the issue is still unresolved
I'm not familiar with LSF job scheduling, but presumably a job would be killed that exceeds a preconfigured quota
In LSF logs, it doesn't look like the jobs go over MEMLIMIT or run time limit.
Can you confirm this and attach any relevant logs? Can you look into the output of bjobs -l <job_id>
which I think should have the max-mem quota for the job?
Also does this happen with certain input FASTAs every time? i.e. if you re-run the input for a failed job does it always fail?
Below is the LSF log that corresponds to the crashed job above. The FCS-GX run was triggered using Nextflow. I have replaced the real directory paths with /path/to/user/directory
and /path/to/run/directory
in the log before posting it here.
According to the LSF log, Max Memory
used was 2469 MB and the quota for Max Memory
was 102400.00 MB. So the quota was about 40x higher than the maximum memory use reported by LSF. The LSF job also requested having at least 600 Gb free space available in /tmp
before starting the run.
You can also see that the run started and finished on the same day, and it was running in the week
queue which has the run time limit of one week. So it definitely didn't run out of LSF run time.
The LSF logs of all the other FCS-GX runs that crashed with the [Command terminated by signal 9]
error look similar to this one.
The crashes don't happen every time with the same input FASTA. When repeating a crashed run with exactly the same FASTA file and exactly the same settings (including the same amount of memory requested), the run may crash or may finish successfully: the outcome is unpredictable.
It's been two weeks since I contacted the LSF administrators about these crashes but they haven't been able to figure out what causes the crashes either.
------------------------------------------------------------
Sender: LSF System <lsfadmin@tol-1-1-4>
Subject: Job 7954052: <nf-FCSGX> in cluster <tol> Exited
Job <nf-FCSGX> was submitted from host <tol-1-1-3> by user <ea10> in cluster <tol> at Wed Feb 14 03:58:15 2024
Job was executed on host(s) <24*tol-1-1-4>, in queue <week>, as user <ea10> in cluster <tol> at Sat Feb 17 16:31:58 2024
</path/to/user/directory/ea10> was used as the home directory.
</path/to/run/directory/fasta/ascc/20240214_aLisHel1.20231230.haplotigs.split_fasta_0000_asg_cobiont_check_run/work/30/966e6cdbc5202899514a815de592f8> was used as the working directory.
Started at Sat Feb 17 16:31:58 2024
Terminated at Sat Feb 17 22:07:35 2024
Results reported at Sat Feb 17 22:07:35 2024
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
#!/bin/bash
#BSUB -o /path/to/run/directory/fasta/ascc/20240214_aLisHel1.20231230.haplotigs.split_fasta_0000_asg_cobiont_check_run/work/30/966e6cdbc5202899514a815de592f8/.command.log
#BSUB -q week
#BSUB -n 24
#BSUB -R "span[hosts=1]"
#BSUB -M 102400
#BSUB -R "select[mem>=102400] rusage[mem=102400]"
#BSUB -J nf-FCSGX
#BSUB -R "select[tmp>600G] rusage[tmp=600G]"
NXF_CHDIR=/path/to/run/directory/fasta/ascc/20240214_aLisHel1.20231230.haplotigs.split_fasta_0000_asg_cobiont_check_run/work/30/966e6cdbc5202899514a815de592f8
# NEXTFLOW TASK: FCSGX
set -e
set -u
NXF_DEBUG=${NXF_DEBUG:=0}; [[ $NXF_DEBUG > 1 ]] && set -x
NXF_ENTRY=${1:-nxf_main}
nxf_tree() {
local pid=$1
declare -a ALL_CHILDREN
while read P PP;do
ALL_CHILDREN[$PP]+=" $P"
done < <(ps -e -o pid= -o ppid=)
pstat() {
local x_pid=$1
local STATUS=$(2> /dev/null < /proc/$1/status grep -E 'Vm|ctxt')
if [ $? = 0 ]; then
local x_vsz=$(echo "$STATUS" | grep VmSize | awk '{print $2}' || echo -n '0')
local x_rss=$(echo "$STATUS" | grep VmRSS | awk '{print $2}' || echo -n '0')
local x_peak=$(echo "$STATUS" | grep -E 'VmPeak|VmHWM' | sed 's/^.*:\s*//' | sed 's/[\sa-zA-Z]*$//' | tr '\n' ' ' || echo -n '0 0')
local x_pmem=$(awk -v rss=$x_rss -v mem_tot=$mem_tot 'BEGIN {printf "%.0f", rss/mem_tot*100*10}' || echo -n '0')
local vol_ctxt=$(echo "$STATUS" | grep '\bvoluntary_ctxt_switches' | awk '{print $2}' || echo -n '0')
local inv_ctxt=$(echo "$STATUS" | grep '\bnonvoluntary_ctxt_switches' | awk '{print $2}' || echo -n '0')
cpu_stat[x_pid]="$x_pid $x_pmem $x_vsz $x_rss $x_peak $vol_ctxt $inv_ctxt"
fi
}
pwalk() {
pstat $1
for i in ${ALL_CHILDREN[$1]:=}; do pwalk $i; done
}
pwalk $1
}
nxf_stat() {
cpu_stat=()
nxf_tree $1
(... more ...)
------------------------------------------------------------
Exited with exit code 1.
Resource usage summary:
CPU time : 16450.00 sec.
Max Memory : 2469 MB
Average Memory : 1852.38 MB
Total Requested Memory : 102400.00 MB
Delta Memory : 99931.00 MB
Max Swap : -
Max Processes : 28
Max Threads : 89
Run time : 20136 sec.
Turnaround time : 324560 sec.
The output (if any) is above this job summary.
(END)
GX needs ~470GiB to run. Can you try setting bsub -M 480000 -R "rusage[mem=480000]"
? Insufficient memory allocation could be the source of the failed job, and not from FCS-GX "crashing."
If GX always needs ~470 GiB to run, how did many runs that I've done finish successfully with much less memory allocated to them? I've done over a hundred runs that didn't crash with 70 Gb or 100 Gb memory assigned to them, and their output looks complete: the final report files have been produced and the report covers the entire assembly. And why do the LSF logs of the crashed FCS-GX runs not show running out of memory? Usually when an LSF job runs out of memory this gets reported in the LSF job's error stream log with the message TERM_MEMLIMIT: job killed after reaching LSF memory usage limit
. This message doesn't appear in the logs of the crashed FCS-GX runs.
Also, if the crashes are just a matter of memory allocation, why do runs with the exact same assembly FASTA file and the same number of gigabytes of memory assigned in the bsub
command sometimes work fine and sometimes crash, when retrying a run multiple times?
I will try assigning FCS-GX 470 Gb memory anyway to see if that helps
Hello. Some of my FCS-GX runs crash with the error message
Command terminated by signal 9
. This is a part of the run log that contains the error message:These are the software versions used for this run: OS: Ubuntu 18.04.6 LTS Singularity: v3.10.0 FCS image: 0.5.0 Python: 3.10.11
Platform: LSF
Do you know what has happened here and how to prevent the crash? I can give you more details about the crashed run if needed.