Open jblachly opened 5 years ago
Setting bcf1_t.max_unpack
makes bcf_read()
skip over parts of the data, which means you will be missing some of the information needed by bcf_write()
. It's intended to be used by programs that are not interested in the per-sample information, and are not going to output the data in vcf or bcf format.
So don't set bcf1_t.max_unpack
if you want to call bcf_write()
.
@daviesrob Agree and understand it is explicitly for when one wants to ignore bits of the record in tradeoff for speed.
My reading of bcf_read()
max_unpack
, bcf1_t
init, and the surrounding code is that it will leave those bits of bcf1_t
uninitialized, but not in an invalid state. Thus writing the record out again should (and usually does) output '.' for FILTER, INFO, FORMAT (if applicable).
Indeed, whereas the parent bcf1_t
returned by bcf_read()
remains valid with empty entries for anything beyond the unpacked values, the bnew
returned from bcf_dup()
is what's being corrupted at some point in the call stack.
Simply accessing the filter field (independent of calling bcf_write
) is enough to cause a segmentation fault because the memory within the (copy of) bcf1_t
is invalid somehow.
It could be a bug in bcf_dup()
. Unfortunately I haven't been able to get a SEGV out of it yet, just other assorted bad behaviour. Would it be possible to come up with a small file that reproduces the crash that you get?
@jblachly The max_unpack feature is intended for faster reading of VCFs with many samples in the special case when the program does not need to access the FORMAT fields, see vcf_parse()
.
It is possible to make it work with BCFs, but the trade off there is not significant and not worth the effort.
I suggest we update the documentation to include a comment saying that the max_unpack feature is a read-only feature which cannot be used for writing.
Dear Petr:
We are not reading from BCFs. In addition, max_unpack has special case code not only for ignoring FORMAT fields, but other elements of the record:
https://github.com/samtools/htslib/blob/ce7daf765c9748bbbe07e738542ed4d345f5c8fd/vcf.c#L1519
https://github.com/samtools/htslib/blob/ce7daf765c9748bbbe07e738542ed4d345f5c8fd/vcf.c#L1543
https://github.com/samtools/htslib/blob/ce7daf765c9748bbbe07e738542ed4d345f5c8fd/vcf.c#L1572
As far as I can tell, the (duplicated) bcf1_t becomes invalid only in the case when we use BCF_UN_STR
, but not other values including BCF_UN_FLT
which is the next level up. Intriguingly, the corrupted element is the n_flt
field.
@daviesrob I will work on a more comprehensive report and/or debugging session this week-end and minimal reproducible example for you.
Hi, thanks for htslib and how responsive developers have always been. Apologies in advance for the long text below.
While exploring potential speedups setting
bcf1_t.max_unpack
as documented in vcf.h:I found that by setting to
BCF_UN_STR
(in contrast to all other values, which seem to work as expected), I can: A. get wrong behavior in some fields (in particular, multiple FILTER results) B. generate occasional to guaranteed segfaults ( with invalid bcf1_t->d.n_flt value )Since
BCF_UN_STR
by definition does not include the FILTER field, and segfaults are related to n_flt/FILTER, I do suspect this is bug, but perhaps I am making calls in wrong order, or usingmax_unpack
incorrectly -- please let me know! Below is a minimal reproducible example.Notes:
b->max_unpack = <whatever>
prior to callingbcf_read
prevents segfaultbcf_unpack(bnew, MAX_UNPACK)
inside the loop increases probability of segfault to 100%: will sigsegv at vcf.c:vcf_format:2739 when trying to kputs a filter that DNE (looking at stack, n_flt is always some larger number when it should be 0 or 1)bcf_unpack
is called, the FILTER column exhibits strange artifacts (multiple when there should not be, mostly)bnew
and reusing onlyb
eliminates issue, see comment at bottomBy removing
bnew
and only operating onb
(without even callingbcf_empty
) the problem disappears entirely and program works as expected.bnew
is included because this C code recapitulates exactly the calls made by an htslib wrapper in a higher level OOP language, for which I do need to copy thebcf1_t
.Thanks again, and if I am calling the API in wrong way or out of order , please, let me know!