samtools / htslib

C library for high-throughput sequencing data formats
Other
785 stars 447 forks source link

markdup `[E::cram_flush_thread] Call to cram_encode_container failed` error #1621

Closed dennishendriksen closed 1 year ago

dennishendriksen commented 1 year ago

Hello htslib team,

Our analysis pipeline fails for specific inputs (which unfortunately I cannot share due to its sensitive nature):

minimap2 -t 8 -a -x sr GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz.mmi R1.fastq.gz R2.fastq.gz \
| samtools fixmate -u -m - - \
| samtools sort -u -@ 8 - \
| samtools markdup -@ 8 --reference GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz --write-index - my.cram

The error seems to be in the markdup command and cram specific:

[M::main] Version: 2.24-r1122
[M::main] CMD: /usr/local/bin/minimap2 -t 8 -a -x sr GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz.mmi R1.fastq.gz R2.fastq.gz
[M::main] Real time: 26286.986 sec; CPU: 207449.276 sec; Peak RSS: 8.171 GB
[bam_sort_core] merging from 8 files and 8 in-memory blocks...
[E::cram_flush_thread] Call to cram_encode_container failed
[E::libdeflate_deflate] Call to libdeflate_alloc_compressor failed
[E::cram_flush_thread] Call to cram_encode_container failed
[E::libdeflate_deflate] Call to libdeflate_alloc_compressor failed
[E::cram_flush_thread] Call to cram_encode_container failed
[E::libdeflate_deflate] Call to libdeflate_alloc_compressor failed
[E::cram_flush_thread] Call to cram_encode_container failed
samtools markdup: error, writing output failed.

[E::cram_flush_thread] Call to cram_encode_container failed
[E::cram_flush_thread] Call to cram_encode_container failed
[E::libdeflate_deflate] Call to libdeflate_alloc_compressor failed
[E::cram_flush_thread] Call to cram_encode_container failed
samtools markdup: error closing output file.

We're running Samtools 1.17. I'm not sure if this is the right repository to post this issue, since the error seems to originate from this library using Samtools.

Any thoughts?

jkbonfield commented 1 year ago

Although it's markdup that is failing, it sounds like it may not be markdup specific.

The error states that function libdeflate_alloc_compressor failed, and looking at the libdeflate source this is either because compression level < 0 or > 12 (unlikely) or the libdeflate_aligned_malloc function it calls fails, which boils down to malloc most likely.

Do you know if you have any memory limits set in this environment, either within the shell (ulimit -a will report this) or as options of a batch submission job? It's possible the failure here is just going over a system imposed memory limit.

If memory limits are out of your control, then using fewer threads will reduce memory as will changing the number of records per CRAM slice with option e.g. -O cram,seqs_per_slice=3000. Note there was also a bug found recently where high numbers of secondary alignments of long-read data with SEQ * long aux. tags would lead to very large CRAM containers, as they don't trigger the default maximum base-count limits. This was fixed in develop: https://github.com/samtools/htslib/pull/1613

dennishendriksen commented 1 year ago

Hi @jkbonfield,

Thank you for the quick reply. I'm indeed running with a memory limit defined in a SLURM batch job. Rerunning with a limit of 32GB instead of 16GB results in a slightly different error:

[M::main] Version: 2.24-r1122                                                                                         
[M::main] CMD: /usr/local/bin/minimap2 -t 8 -a -x sr GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz.mmi R1.fastq.gz R2.fastq.gz                                                                      
[M::main] Real time: 26938.966 sec; CPU: 212805.794 sec; Peak RSS: 8.148 GB                                           
[bam_sort_core] merging from 8 files and 8 in-memory blocks...                                                        
[E::cram_flush_thread] Call to cram_encode_container failed                                                           
[E::cram_flush_thread] Call to cram_encode_container failed                                                           
[E::cram_flush_thread] Call to cram_encode_container failed                                                           
samtools markdup: error, writing output failed.                                                                       

[E::cram_flush_thread] Call to cram_encode_container failed                                                           
[E::cram_flush_thread] Call to cram_encode_container failed                                                           
samtools markdup: error closing output file.

The data is ~2x4GB .fastq paired-end Illumina short-read.

Do you have any additional thoughts based on this slightly different error? In the mean time I will try rerunning with 64GB.

jkbonfield commented 1 year ago

The fact it dies somewhere else implies it's still a memory issue. Does SLURM not give you any diagnostics at the end as to why the job failed, or a memory usage report?

I'm also wondering quite why the memory usage is so high. It's possible you're hitting the bug we fixed in #1613, although given how long it's been that way without complaint it's pretty unlikely. It's also possible we have a memory leak somewhere, but again it'd need to be something very specific to your data as we haven't detected it elsewhere.

For debugging it may be worth outputting a BAM file from the pipeline, and then testing the memory usage of a BAM to CRAM conversion. That would also cut out nearly all of the time required, permitting faster turnaround of experiments. We need to determine if it's a specific region of the alignment which is triggering this high memory (in which case maybe it's something the parameters can be adjusted to resolve), or whether it's just gradually growing in memory over time indicating a memory leak (something that can be manually inspected with "top" or similar).

dennishendriksen commented 1 year ago

Thank you again for the quick reply. I regularly experience out-of-memory issues with SLURM (but this is the first time in the minimap/samtools area), usually the OS oom-killer kicks in reflected in logging and process exit code. That doesn't happen in this case. We've processed many samples with samtools 1.17 and this is the only dataset causing on issue.

I'll try to provide you with as much debugging information as possible based on your comments and come back afterwards.

dennishendriksen commented 1 year ago

Hi @jkbonfield,

Apologies for the late reply. I'm able to reproduce the issue using the output of samtools sort thus cutting out some steps. Some additional findings (all examples are running with a memory limit of 64GB).

0) Converting the samtools sort .bam output to .cram succeeds

1) Writing markdup output to .cram still fails

[E::cram_flush_thread] Call to cram_encode_container failed
[E::cram_flush_thread] Call to cram_encode_container failed
[E::cram_flush_thread] Call to cram_encode_container failed
[E::cram_flush_thread] Call to cram_encode_container failed
[E::libdeflate_deflate] Call to libdeflate_alloc_compressor failed
[E::cram_flush_thread] Call to cram_encode_container failed
samtools markdup: error, writing output failed.

[E::libdeflate_deflate] Call to libdeflate_alloc_compressor failed
[E::cram_flush_thread] Call to cram_encode_container failed
[E::cram_flush_thread] Call to cram_encode_container failed
samtools markdup: error closing output file.

================================================================================================================
JobID         Elapsed   AllocCPUS  AveCPU    ReqMem  MaxVMSize  MaxRSS     MaxDiskRead  MaxDiskWrite
140883.batch  00:02:17  8          00:04:00          36389772K  34599790K  20117.32M    552.41M

2) Writing markdup output to .bam instead of .cram fails as well but with a different error:

.command.sh: line 4: 19081 Segmentation fault      ${CMD_SAMTOOLS} markdup -@ "8" --reference "GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz" --write-index samtools_sort_out.bam samtools_sort_markdup_out.bam
================================================================================================================
JobID         Elapsed   AllocCPUS  AveCPU    ReqMem  MaxVMSize  MaxRSS     MaxDiskRead  MaxDiskWrite
140882.batch  00:01:55  8          00:03:44          30117576K  28294062K  16622.90M    950.76M

Both calls fail after about ~4 minutes and both stay well without their memory limit (~32GB < 64GB).

59G samtools_sort_out.bam
7.1G samtools_sort_out.cram
1.5G samtools_sort_markdup_out.bam <-- amount of data written before process exits with error
609M samtools_sort_markdup_out.cram <-- amount of data written before process exits with error

The input R1 and R2 fastq.gz files are ~4.5GB each.

Since .bam input/output fails as well I guess that #1613 is not relevant for this issue?

Any suggestions on how to proceed pinpointing this issue are highly appreciated.

Greetings, @dennishendriksen

daviesrob commented 1 year ago

The segmentation fault when writing bam suggests that the root cause may be a memory bug in markdup. If your run produced a core file, you could inspect it with gdb to see where it crashed. You should just need to run gdb /path/to/samtools core, then type bt to get a backtrace.

If you don't have a core file, you could try running it under gdb by hand. You'd want to run:

gdb -args /path/to/samtools markdup -@ "8" --reference "GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz" --write-index samtools_sort_out.bam samtools_sort_markdup_out.bam

Then type r to make gdb start running the program, and bt after it catches the segmentation fault.

Even better, if you could do it, would be to try building a copy of samtools with address sanitizer turned on. The quickest way to do this is to build it using make CC='cc -fsanitize=address'. You may need to set the environment variable ASAN_OPTIONS=allow_addr2line=true before running samtools to make it print source locations in its stack traces. The only problem is that SLURM may not like the resulting program because it'll try to use about 1Tb of virtual memory (the real requirement is actually much less, but bigger than for a copy of samtools without the extra debugging.) But if you can find a machine that it works on, it should stop and print a stack trace at the exact point when the illegal memory access happens.

If you could try one of these and let us know what it prints, it'll hopefully give more clues to work on.

dennishendriksen commented 1 year ago

While trying to run gdb I ran into the issue

Reading symbols from /usr/bin/apptainer...Reading symbols from /usr/bin/apptainer...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Missing separate debuginfos, use: debuginfo-install apptainer-1.1.3-1.el7.x86_64

because we are running samtools using a Singularity/Apptainer image (source: here, binary: here).

Installing the debuginfos fails unfortunately

debuginfo-install apptainer-1.1.3-1.el7.x86_64
Traceback (most recent call last):
  File "/usr/bin/debuginfo-install", line 21, in <module>
    import yum
<cut>
  File "/usr/lib/python2.7/site-packages/urlgrabber/grabber.py", line 550, in <module>
    import pycurl
ImportError: /usr/lib64/python2.7/site-packages/pycurl.so: undefined symbol: CRYPTO_num_locks

and I expect it to fail, after resolving this issue, due it a permissions issue.

I decided to switch to an environment module available on the same system as well (but unfortunately can't be used as a workaround for this issue in our workflows) build using EasyBuild (source: here) and to my surprise I could not reproduce the issue: markdup finishes successfully.

Does this give you any more clues on the cause of this issue?

In the mean time I'll try to build a custom a copy of samtools with address sanitizer turned on and report back the results here.

dennishendriksen commented 1 year ago

I can't reproduce the issue with a custom build of samtools with address sanitizer as well, it seems to be specific to the version build as part of the singularity image which is based on Alpine Linux. I've switched the base image from Alpine to Ubuntu and markdup exits successfully on the data that originally caused this issue.

I can do some more analysis if you would like to pinpoint the issue to possibly fix something in SAMtools, but from my side the case can be closed.

Thank you for your support, quick replies and great work on your tools, highly appreciated!

whitwham commented 1 year ago

We do use Alpine Linux when we test releases but I don't think we put anything large through markdup. It might be worth us taking a look.

whitwham commented 1 year ago

My tests with Alpine Linux have not displayed any memory issues yet, even with a large cram file.

whitwham commented 1 year ago

I have tested markdup with the provided singularity image and still no obvious errors. I don't think we can reproduce this bug.

dennishendriksen commented 1 year ago

Hello @whitwham,

Unfortunately I'm not allowed to share the data to reproduce the issue. I've tried reproducing the issue with sharable data but without success. I'll report here when I find time to debug samtools using the details provided by @daviesrob. Feel free to close the issue.

Greetings, @dennishendriksen

whitwham commented 1 year ago

Closing as we are unable to reproduce error.