Closed lbeltrame closed 2 months ago
Do you get an empty exitcode from slurm?
This is because the process is killed, and since it is piped into samtools, samtools errors out because the file is incomplete.
EDIT: Pasted the wrong log.
fgbio \
-Xmx8g \
--tmp-dir=. \
--compression=0 \
FilterConsensusReads \
--input 5571767_P4.mapped.bam \
--output /dev/stdout \
--ref hg38.fa \
--min-reads 1 \
--min-base-quality 20 \
--max-base-error-rate 0.1 \
\
| samtools sort \
--threads 4 \
-o 5571767_P4.cons.filtered.bam##idx##5571767_P4.cons.filtered.bam.bai \
--write-index \
;
cat <<-END_VERSIONS > versions.yml
"NFCORE_FASTQUORUM:FASTQUORUM:FILTERCONSENSUSREADS":
fgbio: $( echo $(fgbio --version 2>&1 | tr -d '[:cntrl:]' ) | sed -e 's/^.*Version: //;s/\[.*$//')
END_VERSIONS
Command exit status:
1
Command output:
(empty)
Command error:
[2024/06/04 10:38:20 | FgBioMain | Info] Executing FilterConsensusReads from fgbio version 2.0.2 as lbeltrame@node7 on JRE 17.0$
3-internal+0-adhoc..src with snappy, IntelInflater, and IntelDeflater
[2024/06/04 10:39:07 | FilterConsensusReads | Info] Filtering reads.
[2024/06/04 10:39:08 | SamWriter | Info] Seen many non-increasing record positions. Printing Read-names as well.
[W::bgzf_read_block] EOF marker is absent. The input may be truncated
samtools sort: truncated file. Aborting
It doesn't tell you why, but looking at .command.log
:
[2024/06/04 12:33:57 | FgBioMain | Info] Executing FilterConsensusReads from fgbio version 2.0.2 as lbeltrame@node4 on JRE 17.0.3-internal+0-adhoc..src with snappy, IntelInflater, and IntelDeflater
[2024/06/04 12:34:55 | FilterConsensusReads | Info] Filtering reads.
[2024/06/04 12:34:56 | SamWriter | Info] Seen many non-increasing record positions. Printing Read-names as well.
[W::bgzf_read_block] EOF marker is absent. The input may be truncated
samtools sort: truncated file. Aborting
slurmstepd-node4: error: Detected 1 oom-kill event(s) in step 647900.batch cgroup. Some of your processes may have been killed by
the cgroup out-of-memory handler.
Notice that there is a similar problem with FastQC. I had to make an override to raise the memory limit to 12G to make sure it finished.
I'd be glad to pointed where you think the config should live. Is the user config you're talking about in conf/base.config
? Given the FilterConsensusReads
process is a module-level config, why isn't it appropriate to have it in the conf/modules.config
?
Of course, users are always free to override any of the configs in this pipeline, which is by design. I think it's reasonable that folks would need to customize the resources based on their input data and compute environment. What am I missing?
Hi Nils, I think the issue is that the memory is fixed at 8Gb, and doesn't increase when the module fails because it doesn't have the user retry part.
Fixed in #60
Description of the bug
There is a hardcoded limit of 8GB in the FilterConsensusReads processes defined in
modules.config
. Aside being on the low side for very deep sequenced datasets (think ctDNA), it relies on retries to increase memory requirements should the task fail. However, what happens with many schedulers like SLURM is that if the task goes out of memory (and with 10,000X data it is a likely possibility) it is killed, causing an error and the premature end of the pipeline (so it won't ever be retried).I would think this should belong to the user's definitions and should not be hardcoded.
Command used and terminal output
No response
Relevant files
No response
System information