samtools / htslib

C library for high-throughput sequencing data formats
Other
813 stars 447 forks source link

max_unpack BCF_UN_STR SIGSEGV, other strange behavior #848

Open jblachly opened 5 years ago

jblachly commented 5 years ago

Hi, thanks for htslib and how responsive developers have always been. Apologies in advance for the long text below.

While exploring potential speedups setting bcf1_t.max_unpack as documented in vcf.h:

 int max_unpack; // Set to BCF_UN_STR, BCF_UN_FLT, or BCF_UN_INFO to boost performance of vcf_parse when some of the fields won't be needed

I found that by setting to BCF_UN_STR (in contrast to all other values, which seem to work as expected), I can: A. get wrong behavior in some fields (in particular, multiple FILTER results) B. generate occasional to guaranteed segfaults ( with invalid bcf1_t->d.n_flt value )

Since BCF_UN_STR by definition does not include the FILTER field, and segfaults are related to n_flt/FILTER, I do suspect this is bug, but perhaps I am making calls in wrong order, or using max_unpack incorrectly -- please let me know! Below is a minimal reproducible example.

Notes:

// copy one vcf file to another
#include <stdio.h>

#include <htslib/vcf.h>

#define MAX_UNPACK BCF_UN_STR

int main(void)
{
    vcfFile *infile = bcf_open("resources/gnomad.chrY.vcf", "r");
    vcfFile *outfile = bcf_open("/tmp/out.vcf", "w");

    bcf_hdr_t* infile_hdr = bcf_hdr_read(infile);
    bcf_hdr_t* outfile_hdr = bcf_hdr_dup(infile_hdr);

    // uncomment below to increase frequency of segfault; likewise you can add a second header line to increase probabilty further
    bcf_hdr_append(outfile_hdr, "##contig=<ID=chrZ,length=57227415,source=whatever>");

    bcf_hdr_write(outfile, outfile_hdr);

    bcf1_t *b = bcf_init1();
    b->max_unpack = MAX_UNPACK;

    while( bcf_read(infile, infile_hdr, b) >= 0 )
    {
        bcf1_t *bnew = bcf_dup(b);
        int ret = bcf_unpack(bnew, MAX_UNPACK);

        bcf_translate(outfile_hdr, infile_hdr, bnew);

        if (bnew->d.n_flt > 1) break;

        bcf_write(outfile, outfile_hdr, bnew);    // SIGSEGV in this call tree

        bcf_destroy(bnew);
        bcf_empty(b);
    }

    bcf_close(infile);
    bcf_close(outfile);

    return 0;
}

By removing bnew and only operating on b (without even calling bcf_empty) the problem disappears entirely and program works as expected. bnew is included because this C code recapitulates exactly the calls made by an htslib wrapper in a higher level OOP language, for which I do need to copy the bcf1_t.

Thanks again, and if I am calling the API in wrong way or out of order , please, let me know!

daviesrob commented 5 years ago

Setting bcf1_t.max_unpack makes bcf_read() skip over parts of the data, which means you will be missing some of the information needed by bcf_write(). It's intended to be used by programs that are not interested in the per-sample information, and are not going to output the data in vcf or bcf format.

So don't set bcf1_t.max_unpack if you want to call bcf_write().

jblachly commented 5 years ago

@daviesrob Agree and understand it is explicitly for when one wants to ignore bits of the record in tradeoff for speed.

My reading of bcf_read() max_unpack, bcf1_t init, and the surrounding code is that it will leave those bits of bcf1_t uninitialized, but not in an invalid state. Thus writing the record out again should (and usually does) output '.' for FILTER, INFO, FORMAT (if applicable).

Indeed, whereas the parent bcf1_t returned by bcf_read() remains valid with empty entries for anything beyond the unpacked values, the bnew returned from bcf_dup() is what's being corrupted at some point in the call stack.

Simply accessing the filter field (independent of calling bcf_write) is enough to cause a segmentation fault because the memory within the (copy of) bcf1_t is invalid somehow.

daviesrob commented 5 years ago

It could be a bug in bcf_dup(). Unfortunately I haven't been able to get a SEGV out of it yet, just other assorted bad behaviour. Would it be possible to come up with a small file that reproduces the crash that you get?

pd3 commented 5 years ago

@jblachly The max_unpack feature is intended for faster reading of VCFs with many samples in the special case when the program does not need to access the FORMAT fields, see vcf_parse(). It is possible to make it work with BCFs, but the trade off there is not significant and not worth the effort.

I suggest we update the documentation to include a comment saying that the max_unpack feature is a read-only feature which cannot be used for writing.

jblachly commented 5 years ago

Dear Petr:

We are not reading from BCFs. In addition, max_unpack has special case code not only for ignoring FORMAT fields, but other elements of the record:

https://github.com/samtools/htslib/blob/ce7daf765c9748bbbe07e738542ed4d345f5c8fd/htslib/vcf.h#L351-L365

https://github.com/samtools/htslib/blob/ce7daf765c9748bbbe07e738542ed4d345f5c8fd/vcf.c#L1519

https://github.com/samtools/htslib/blob/ce7daf765c9748bbbe07e738542ed4d345f5c8fd/vcf.c#L1543

https://github.com/samtools/htslib/blob/ce7daf765c9748bbbe07e738542ed4d345f5c8fd/vcf.c#L1572

As far as I can tell, the (duplicated) bcf1_t becomes invalid only in the case when we use BCF_UN_STR, but not other values including BCF_UN_FLT which is the next level up. Intriguingly, the corrupted element is the n_flt field.

@daviesrob I will work on a more comprehensive report and/or debugging session this week-end and minimal reproducible example for you.