Closed dennishendriksen closed 1 year ago
Although it's markdup that is failing, it sounds like it may not be markdup specific.
The error states that function libdeflate_alloc_compressor failed, and looking at the libdeflate source this is either because compression level < 0 or > 12 (unlikely) or the libdeflate_aligned_malloc
function it calls fails, which boils down to malloc
most likely.
Do you know if you have any memory limits set in this environment, either within the shell (ulimit -a
will report this) or as options of a batch submission job? It's possible the failure here is just going over a system imposed memory limit.
If memory limits are out of your control, then using fewer threads will reduce memory as will changing the number of records per CRAM slice with option e.g. -O cram,seqs_per_slice=3000
. Note there was also a bug found recently where high numbers of secondary alignments of long-read data with SEQ *
long aux. tags would lead to very large CRAM containers, as they don't trigger the default maximum base-count limits. This was fixed in develop: https://github.com/samtools/htslib/pull/1613
Hi @jkbonfield,
Thank you for the quick reply. I'm indeed running with a memory limit defined in a SLURM batch job. Rerunning with a limit of 32GB instead of 16GB results in a slightly different error:
[M::main] Version: 2.24-r1122
[M::main] CMD: /usr/local/bin/minimap2 -t 8 -a -x sr GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz.mmi R1.fastq.gz R2.fastq.gz
[M::main] Real time: 26938.966 sec; CPU: 212805.794 sec; Peak RSS: 8.148 GB
[bam_sort_core] merging from 8 files and 8 in-memory blocks...
[E::cram_flush_thread] Call to cram_encode_container failed
[E::cram_flush_thread] Call to cram_encode_container failed
[E::cram_flush_thread] Call to cram_encode_container failed
samtools markdup: error, writing output failed.
[E::cram_flush_thread] Call to cram_encode_container failed
[E::cram_flush_thread] Call to cram_encode_container failed
samtools markdup: error closing output file.
The data is ~2x4GB .fastq paired-end Illumina short-read.
Do you have any additional thoughts based on this slightly different error? In the mean time I will try rerunning with 64GB.
The fact it dies somewhere else implies it's still a memory issue. Does SLURM not give you any diagnostics at the end as to why the job failed, or a memory usage report?
I'm also wondering quite why the memory usage is so high. It's possible you're hitting the bug we fixed in #1613, although given how long it's been that way without complaint it's pretty unlikely. It's also possible we have a memory leak somewhere, but again it'd need to be something very specific to your data as we haven't detected it elsewhere.
For debugging it may be worth outputting a BAM file from the pipeline, and then testing the memory usage of a BAM to CRAM conversion. That would also cut out nearly all of the time required, permitting faster turnaround of experiments. We need to determine if it's a specific region of the alignment which is triggering this high memory (in which case maybe it's something the parameters can be adjusted to resolve), or whether it's just gradually growing in memory over time indicating a memory leak (something that can be manually inspected with "top" or similar).
Thank you again for the quick reply. I regularly experience out-of-memory issues with SLURM (but this is the first time in the minimap/samtools area), usually the OS oom-killer kicks in reflected in logging and process exit code. That doesn't happen in this case. We've processed many samples with samtools 1.17 and this is the only dataset causing on issue.
I'll try to provide you with as much debugging information as possible based on your comments and come back afterwards.
Hi @jkbonfield,
Apologies for the late reply. I'm able to reproduce the issue using the output of samtools sort
thus cutting out some steps.
Some additional findings (all examples are running with a memory limit of 64GB).
0) Converting the samtools sort
.bam
output to .cram
succeeds
1) Writing markdup
output to .cram still fails
[E::cram_flush_thread] Call to cram_encode_container failed
[E::cram_flush_thread] Call to cram_encode_container failed
[E::cram_flush_thread] Call to cram_encode_container failed
[E::cram_flush_thread] Call to cram_encode_container failed
[E::libdeflate_deflate] Call to libdeflate_alloc_compressor failed
[E::cram_flush_thread] Call to cram_encode_container failed
samtools markdup: error, writing output failed.
[E::libdeflate_deflate] Call to libdeflate_alloc_compressor failed
[E::cram_flush_thread] Call to cram_encode_container failed
[E::cram_flush_thread] Call to cram_encode_container failed
samtools markdup: error closing output file.
================================================================================================================
JobID Elapsed AllocCPUS AveCPU ReqMem MaxVMSize MaxRSS MaxDiskRead MaxDiskWrite
140883.batch 00:02:17 8 00:04:00 36389772K 34599790K 20117.32M 552.41M
2) Writing markdup output to .bam instead of .cram fails as well but with a different error:
.command.sh: line 4: 19081 Segmentation fault ${CMD_SAMTOOLS} markdup -@ "8" --reference "GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz" --write-index samtools_sort_out.bam samtools_sort_markdup_out.bam
================================================================================================================
JobID Elapsed AllocCPUS AveCPU ReqMem MaxVMSize MaxRSS MaxDiskRead MaxDiskWrite
140882.batch 00:01:55 8 00:03:44 30117576K 28294062K 16622.90M 950.76M
Both calls fail after about ~4 minutes and both stay well without their memory limit (~32GB < 64GB).
59G samtools_sort_out.bam
7.1G samtools_sort_out.cram
1.5G samtools_sort_markdup_out.bam <-- amount of data written before process exits with error
609M samtools_sort_markdup_out.cram <-- amount of data written before process exits with error
The input R1 and R2 fastq.gz files are ~4.5GB each.
Since .bam input/output fails as well I guess that #1613 is not relevant for this issue?
Any suggestions on how to proceed pinpointing this issue are highly appreciated.
Greetings, @dennishendriksen
The segmentation fault when writing bam suggests that the root cause may be a memory bug in markdup
. If your run produced a core
file, you could inspect it with gdb
to see where it crashed. You should just need to run gdb /path/to/samtools core
, then type bt
to get a backtrace.
If you don't have a core file, you could try running it under gdb
by hand. You'd want to run:
gdb -args /path/to/samtools markdup -@ "8" --reference "GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz" --write-index samtools_sort_out.bam samtools_sort_markdup_out.bam
Then type r
to make gdb start running the program, and bt
after it catches the segmentation fault.
Even better, if you could do it, would be to try building a copy of samtools with address sanitizer turned on. The quickest way to do this is to build it using make CC='cc -fsanitize=address'
. You may need to set the environment variable ASAN_OPTIONS=allow_addr2line=true
before running samtools
to make it print source locations in its stack traces. The only problem is that SLURM may not like the resulting program because it'll try to use about 1Tb of virtual memory (the real requirement is actually much less, but bigger than for a copy of samtools without the extra debugging.) But if you can find a machine that it works on, it should stop and print a stack trace at the exact point when the illegal memory access happens.
If you could try one of these and let us know what it prints, it'll hopefully give more clues to work on.
While trying to run gdb
I ran into the issue
Reading symbols from /usr/bin/apptainer...Reading symbols from /usr/bin/apptainer...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Missing separate debuginfos, use: debuginfo-install apptainer-1.1.3-1.el7.x86_64
because we are running samtools using a Singularity/Apptainer image (source: here, binary: here).
Installing the debuginfos fails unfortunately
debuginfo-install apptainer-1.1.3-1.el7.x86_64
Traceback (most recent call last):
File "/usr/bin/debuginfo-install", line 21, in <module>
import yum
<cut>
File "/usr/lib/python2.7/site-packages/urlgrabber/grabber.py", line 550, in <module>
import pycurl
ImportError: /usr/lib64/python2.7/site-packages/pycurl.so: undefined symbol: CRYPTO_num_locks
and I expect it to fail, after resolving this issue, due it a permissions issue.
I decided to switch to an environment module available on the same system as well (but unfortunately can't be used as a workaround for this issue in our workflows) build using EasyBuild (source: here) and to my surprise I could not reproduce the issue: markdup
finishes successfully.
Does this give you any more clues on the cause of this issue?
In the mean time I'll try to build a custom a copy of samtools with address sanitizer turned on and report back the results here.
I can't reproduce the issue with a custom build of samtools with address sanitizer as well, it seems to be specific to the version build as part of the singularity image which is based on Alpine Linux. I've switched the base image from Alpine to Ubuntu and markdup
exits successfully on the data that originally caused this issue.
I can do some more analysis if you would like to pinpoint the issue to possibly fix something in SAMtools, but from my side the case can be closed.
Thank you for your support, quick replies and great work on your tools, highly appreciated!
We do use Alpine Linux when we test releases but I don't think we put anything large through markdup. It might be worth us taking a look.
My tests with Alpine Linux have not displayed any memory issues yet, even with a large cram file.
I have tested markdup with the provided singularity image and still no obvious errors. I don't think we can reproduce this bug.
Hello @whitwham,
Unfortunately I'm not allowed to share the data to reproduce the issue. I've tried reproducing the issue with sharable data but without success. I'll report here when I find time to debug samtools using the details provided by @daviesrob. Feel free to close the issue.
Greetings, @dennishendriksen
Closing as we are unable to reproduce error.
Hello htslib team,
Our analysis pipeline fails for specific inputs (which unfortunately I cannot share due to its sensitive nature):
The error seems to be in the
markdup
command andcram
specific:We're running Samtools 1.17. I'm not sure if this is the right repository to post this issue, since the error seems to originate from this library using Samtools.
Any thoughts?