samtools / htslib

C library for high-throughput sequencing data formats
Other
784 stars 447 forks source link

segfault during `cram_generate_reference` #1691

Closed OctavioGalland closed 7 months ago

OctavioGalland commented 7 months ago

Summary

Segfault in cram_generate_reference during parsing of a crafted SAM/FASTA file pair.

Environment

Built using LLVM 14 with ASAN on Ubuntu 22.04

How to reproduce

Build with ASAN on latest commit like so:

git clone --recursive https://github.com/samtools/htslib
cd htslib
autoreconf -i
CC=clang-14 CXX=clang++-14 CFLAGS="-fsanitize=address -g" CXXFLAGS="-fsanitize=address -g" LDFLAGS="-fsanitize=address -g" ./configure
make -j$(nproc)

git clone --recursive https://github.com/samtools/samtools
cd samtools
autoheader
autoconf -Wno-syntax
CC=clang-14 CXX=clang++-14 CFLAGS="-fsanitize=address -g -I$(pwd)/../htslib" CXXFLAGS="-fsanitize=address -g -I$(pwd)/../htslib" LDFLAGS="-fsanitize=address -g -L$(pwd)/../htslib" ./configure
make -j$(nproc)

Within the samtools folder, get poc file and reproduce with:

echo -ne "QFNRCVNOOmMxCUxOOjEUCnMwCTAJUDEwMdPT0zAx09PT09MQ08LT09PT09PTENPC09PT09MJMAkJ
ME0JKgkwCQkJCgkJQUMpKskqCg==" | base64 -d > poc
./samtools view -C -T ../htslib/test/c2.fa poc

Which on my setup outputs:

[W::cram_get_ref] Reference file given, but ref 'c1' not present
[W::cram_get_ref] Failed to populate reference for id 0
[W::cram_write_SAM_hdr] No M5 tags present and could not find reference
[W::cram_write_SAM_hdr] Enabling embed_ref=2 option
[W::cram_write_SAM_hdr] NOTE: the CRAM file will be bigger than using an external reference
CRAM-�\��vvr@SQ SN:c1   LN:1
@PG ID:samtools PN:samtools VN:1.18-21-g528e1b2 CL:./samtools view -C -T ../htslib/test/c2.fa poc
X�}�??}�Y�[W::sam_parse1] unrecognized reference name "P101\xD3\xD3\xD301\xD3\xD3\xD3\xD3"...; treated as unmapped
[W::sam_parse1] empty query name
[W::sam_read1_sam] Parse error at line 3
AddressSanitizer:DEADLYSIGNAL
=================================================================
==11136==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x5647d007e38a bp 0x7ffe49270f70 sp 0x7ffe49270c60 T0)
==11136==The signal is caused by a READ memory access.
==11136==Hint: address points to the zero page.
    #0 0x5647d007e38a in cram_generate_reference /home/octavio/htslib/cram/cram_encode.c:1675:14
    #1 0x5647d0071ee5 in cram_encode_container /home/octavio/htslib/cram/cram_encode.c:1876:17
    #2 0x5647d00e496c in cram_flush_container /home/octavio/htslib/cram/cram_io.c:4128:14
    #3 0x5647d00e5795 in cram_flush_container_mt /home/octavio/htslib/cram/cram_io.c:4280:16
    #4 0x5647d00f0742 in cram_flush /home/octavio/htslib/cram/cram_io.c:5431:19
    #5 0x5647cff7f2e7 in hts_flush /home/octavio/htslib/hts.c:1667:16
    #6 0x5647cfef8084 in vprint_error_core /home/octavio/samtools/sam_utils.c:48:26
    #7 0x5647cfef845c in print_error_errno /home/octavio/samtools/sam_utils.c:71:5
    #8 0x5647cfd01472 in stream_view /home/octavio/samtools/sam_view.c:762:9
    #9 0x5647cfcfcbe8 in main_samview /home/octavio/samtools/sam_view.c:1363:15
    #10 0x5647cfd7beed in main /home/octavio/samtools/bamtk.c:244:55
    #11 0x7fee11629d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #12 0x7fee11629e3f in __libc_start_main csu/../csu/libc-start.c:392:3
    #13 0x5647cfc20b24 in _start (/home/octavio/samtools/samtools+0xb0b24) (BuildId: 7078ea94d4e08689f85e1df47e2d609c021d2440)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /home/octavio/htslib/cram/cram_encode.c:1675:14 in cram_generate_reference
==11136==ABORTING
jkbonfield commented 7 months ago

Thank you for these and the other fuzz testing issues. It raises an interesting point about the robustness of our own fuzzing too.

I thought it already did read-write testing, but while the input format can be fuzzed to any suitable data format, the output was always SAM so the cram writer has not been fuzzed. That's an oversight and something we can fix in the fuzz testing harness.

I've made a single branch for all the CRAM related issues. I have a few memory leaks to tidy up still but hope to have a PR tomorrow.