samtools / htslib

C library for high-throughput sequencing data formats
Other
784 stars 447 forks source link

Large memory leak involving libcurl when attempting to use htslib C library on remote BAM #1678

Closed rlorigro closed 9 months ago

rlorigro commented 9 months ago

Hi,

I am attempting to use the htslib C interface in my project and I find a large leak from htslib using both AddressSanitizer and Valgrind. Here is the more verbose output from valgrind:

==10353== 
==10353== HEAP SUMMARY:
==10353==     in use at exit: 5,611,520 bytes in 1 blocks
==10353==   total heap usage: 451,489 allocs, 451,488 frees, 70,065,150 bytes allocated
==10353== 
==10353== 5,611,520 bytes in 1 blocks are possibly lost in loss record 1 of 1
==10353==    at 0x484DCD3: realloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==10353==    by 0x48C0401: ??? (in /usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0)
==10353==    by 0x48EC5B7: ??? (in /usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0)
==10353==    by 0x4906B00: ??? (in /usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0)
==10353==    by 0x48E8193: ??? (in /usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0)
==10353==    by 0x48EB3BD: curl_multi_perform (in /usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0)
==10353==    by 0x1AB631: wait_perform (hfile_libcurl.c:733)
==10353==    by 0x1ACD67: restart_from_position (hfile_libcurl.c:1059)
==10353==    by 0x1AE103: libcurl_seek (hfile_libcurl.c:977)
==10353==    by 0x19F7AD: hseek (hfile.c:469)
==10353==    by 0x18E6D0: bgzf_check_EOF_common (bgzf.c:1536)
==10353==    by 0x190167: bgzf_check_EOF (bgzf.c:2149)
==10353== 
==10353== LEAK SUMMARY:
==10353==    definitely lost: 0 bytes in 0 blocks
==10353==    indirectly lost: 0 bytes in 0 blocks
==10353==      possibly lost: 5,611,520 bytes in 1 blocks
==10353==    still reachable: 0 bytes in 0 blocks
==10353==         suppressed: 0 bytes in 0 blocks
==10353== 
==10353== For lists of detected and suppressed errors, rerun with: -s
==10353== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

For this test I have used this public GS URI: gs://genomics-public-data/platinum-genomes/bam/NA12889_S1.bam

When I run the same code on a local BAM path (stored on my disk) I get a clean result from valgrind:

==11976== 
==11976== HEAP SUMMARY:
==11976==     in use at exit: 0 bytes in 0 blocks
==11976==   total heap usage: 1,367 allocs, 1,367 frees, 551,570 bytes allocated
==11976== 
==11976== All heap blocks were freed -- no leaks are possible
==11976== 
==11976== For lists of detected and suppressed errors, rerun with: -s
==11976== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

It appears to be coming from libcurl, so I asked the curl authors and they are pretty confident that it is a result of a misuse of the library. However, it could easily be something I have done incorrectly because I am not an expert on C style memory management. My code is here:

void read_bam(path bam_path){
    samFile* bam_file = nullptr;
    bam_hdr_t* bam_header = nullptr;
    hts_idx_t* bam_index = nullptr;
    bam1_t* alignment = bam_init1();

    if ((bam_file = hts_open(bam_path.string().c_str(), "r")) == nullptr) {
        throw runtime_error("ERROR: Cannot open bam file: " + bam_path.string());
    }

    // bam index
    if ((bam_index = sam_index_load(bam_file, bam_path.string().c_str())) == nullptr) {
        throw runtime_error("ERROR: Cannot open index for bam file: " + bam_path.string() + "\n");
    }

    // bam header
    if ((bam_header = sam_hdr_read(bam_file)) == nullptr){
        throw runtime_error("ERROR: Cannot open header for bam file: " + bam_path.string() + "\n");
    }

    int i = 0;
    while (sam_read1(bam_file, bam_header, alignment) >= 0){

        // Copy contents of C object into strings
        string read_name = bam_get_qname(alignment);
        string ref_name = bam_header->target_name[alignment->core.tid];

        cerr << read_name << ' ' << ref_name << '\n';

        if (++i == 10){
            break;
        }
    }

    hts_close(bam_file);
    bam_hdr_destroy(bam_header);
    bam_destroy1(alignment);
    hts_idx_destroy(bam_index);
}

Could you let me know if this is a simple mistake on my end or something more serious? I need to run a function like this on many windows repeatedly so this leak would get out of hand pretty quickly.

Thanks

jmarshall commented 9 months ago

You don't say what platform you're using.

Elsewhere, you say you are using Ubuntu 22.04. I was able to reproduce this on ARM Ubuntu 22.04; using the original 22.04 libcurl3-gnutls 7.81.0-1 (rather than the current 7.81.0-1ubuntu1.13) the other symbols in the stack trace are present:

==153381== 1,402,880 bytes in 1 blocks are definitely lost in loss record 1 of 1
==153381==    at 0x486A190: realloc (in /usr/libexec/valgrind/vgpreload_memcheck-arm64-linux.so)
==153381==    by 0x494B9E3: dyn_nappend.lto_priv.0 (dynbuf.c:98)
==153381==    by 0x49752F3: UnknownInlinedFun (dynbuf.c:164)
==153381==    by 0x49752F3: pausewrite (sendf.c:508)
==153381==    by 0x49897FF: UnknownInlinedFun (transfer.c:842)
==153381==    by 0x49897FF: Curl_readwrite (transfer.c:1226)
==153381==    by 0x497161B: multi_runsingle (multi.c:2360)
==153381==    by 0x49740DF: curl_multi_perform (multi.c:2638)
==153381==    by 0x19A55B: wait_perform (hfile_libcurl.c:735)
==153381==    by 0x19BA17: restart_from_position (hfile_libcurl.c:1061)
==153381==    by 0x19CA5B: libcurl_seek (hfile_libcurl.c:979)
==153381==    by 0x18E10B: hseek (hfile.c:469)
==153381==    by 0x17D483: bgzf_check_EOF_common (bgzf.c:1542)
==153381==    by 0x17EECB: bgzf_check_EOF (bgzf.c:2155)

This localises the unfreed allocation within libcurl. (Note also that the size of the problematic allocation changes each time you run it!)

Running with vanilla upstream libcurl 7.81.0 or with current upstream master, and without any other changes, there is no leak reported. (I was also unable to reproduce any leak on macOS.)

So the problem would appear to be caused by Ubuntu's patches to their libcurl package.

daviesrob commented 9 months ago

Here's the full (non-inlined) trace from an x86_86 version of libcurl 7.81.0 with the Ubuntu patches from 7.81.0-1ubuntu1.13:

==179979== HEAP SUMMARY:
==179979==     in use at exit: 6,312,960 bytes in 2 blocks
==179979==   total heap usage: 253,242 allocs, 253,240 frees, 117,189,900 bytes allocated
==179979== 
==179979== 6,312,960 bytes in 2 blocks are definitely lost in loss record 1 of 1
==179979==    at 0x484DCD3: realloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==179979==    by 0x49CA5AD: dyn_nappend (dynbuf.c:98)
==179979==    by 0x49CA6CC: Curl_dyn_addn (dynbuf.c:164)
==179979==    by 0x4A0F1F2: pausewrite (sendf.c:508)
==179979==    by 0x4A0F297: chop_write (sendf.c:539)
==179979==    by 0x4A0F591: Curl_client_write (sendf.c:645)
==179979==    by 0x4A2D746: readwrite_data (transfer.c:842)
==179979==    by 0x4A2E209: Curl_readwrite (transfer.c:1226)
==179979==    by 0x4A02F86: multi_runsingle (multi.c:2360)
==179979==    by 0x4A03796: curl_multi_perform (multi.c:2638)
==179979==    by 0x216894: wait_perform (hfile_libcurl.c:733)
==179979==    by 0x217738: restart_from_position (hfile_libcurl.c:1059)
==179979== 
==179979== LEAK SUMMARY
==179979==    definitely lost: 6,312,960 bytes in 2 blocks
==179979==    indirectly lost: 0 bytes in 0 blocks
==179979==      possibly lost: 0 bytes in 0 blocks
==179979==    still reachable: 0 bytes in 0 blocks
==179979==         suppressed: 0 bytes in 0 blocks

This was built with the Ubuntu configure options, notably --with-nghttp2. The issue actually also affects the non-patched 7.81.0, but not libcurl master. The problem is that when using HTTP/2, libcurl buffers up a bit of data while we're trying to seek to the end of the file for the EOF block check, and that data doesn't get cleaned up.

After hunting around a bit, I found issue curl/curl#10971 (Memory leak when cancelling paused http 2 transfer) which was fixed by the commit https://github.com/curl/curl/commit/81b2b577df. The scenario in that issue is pretty much exactly what HTSlib's restart_from_position() does, and the GCS server is dishing out HTTP/2.

So it's actually a libcurl problem, which has been fixed and will eventually make it into Ubuntu (although probably not 22.04).

rlorigro commented 9 months ago

I see, thank you for finding this so quickly. I should have known the leak wasn't from htslib :)

Is there a way to configure htslib to use a specific curl path (that I have compiled from source)? My build instructions are in CMake so it looks like this:

    # HTSLIB configure
    add_custom_target(
            BUILD_HTS
            ALL
            WORKING_DIRECTORY ${HTS_DIR}
            COMMAND pwd
            COMMAND autoreconf -i
            COMMAND ./configure --disable-lzma --without-libdeflate
            COMMAND $(MAKE) print-config
            COMMAND $(MAKE) prefix=${CMAKE_SOURCE_DIR}/external/htslib/ install
    )
jmarshall commented 9 months ago

Nice spotting, Rob — I didn't think to look at the configuration options. One advantage of RPMs is the single spec file makes it much easier for third parties to recreate distributions' packages…

daviesrob commented 9 months ago

You can direct HTSlib by setting CPPFLAGS and LDFLAGS on the configure command-line. See the INSTALL file for an example.

rlorigro commented 9 months ago

Thanks

I was able to get this to work in CMake. It downloads and compiles curl 8.3.0, links htslib to that, and the remote BAM query does not leak when tested with valgrind.

In case anyone else has this very niche problem: the working CMakeLists.txt is here