samtools / htslib

C library for high-throughput sequencing data formats
Other
809 stars 446 forks source link

bcf_hdr_remove leaves hash keys behind #1533

Open pd3 opened 1 year ago

pd3 commented 1 year ago

Somehow this issue https://github.com/samtools/htslib/issues/842 has not been fully addressed. The following code shows the problem:

$ echo '#include <htslib/vcf.h>

int main() {
  htsFile *in = hts_open("-", "r");
  bcf_hdr_t *hdr = bcf_hdr_read(in);
  fprintf(stderr, "before: rid=%d n_contigs=%d\n", bcf_hdr_id2int(hdr, BCF_DT_CTG, "chr22"), hdr->n[BCF_DT_CTG]);
  // remove all contigs
  bcf_hdr_remove(hdr, BCF_HL_CTG, NULL);
  // add a new contig
  bcf_hdr_printf(hdr, "##contig=<ID=chr22,length=50818468>");
  bcf_hdr_sync(hdr);
  // show that the number of contigs is increasing
  fprintf(stderr, "after: rid=%d n_contigs=%d\n", bcf_hdr_id2int(hdr, BCF_DT_CTG, "chr22"), hdr->n[BCF_DT_CTG]);
  htsFile *out = hts_open("-", "wb");
  bcf_hdr_write(out, hdr);
  bcf_hdr_destroy(hdr);
  hts_close(in);
  hts_close(out);
  return 0;
}' > /tmp/main.c

Then compling:

$ gcc /tmp/main.c -o /tmp/main -lhts

Create a toy VCF file:

$ (echo -e "##fileformat=VCFv4.2\n##contig=<ID=chr22,length=51304566>\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO") > /tmp/main.vcf

Then running a first test:

$ cat /tmp/main.vcf | /tmp/main > /dev/null
before: rid=0 n_contigs=1
after: rid=1 n_contigs=2

Shows that the number of contigs is increasing even if the code removed all contigs with the bcf_hdr_remove() function

The number of contigs increases at each iteration:

$ cat /tmp/main.vcf | /tmp/main | /tmp/main > /dev/null
before: rid=0 n_contigs=1
after: rid=1 n_contigs=2
before: rid=1 n_contigs=2
after: rid=2 n_contigs=3

Unless the VCF is converted to text file in between:

$ cat /tmp/main.vcf | /tmp/main | bcftools view | /tmp/main > /dev/null
before: rid=0 n_contigs=1
after: rid=1 n_contigs=2
before: rid=0 n_contigs=1
after: rid=1 n_contigs=2

This must not be the intended behavior as it leads to unwanted scenarios:

$ cat /tmp/main.vcf | /tmp/main > /tmp/main.bcf && bcftools index -f /tmp/main.bcf
$ bcftools isec -C /tmp/main.bcf /tmp/main.bcf
bcftools: vcf.c:2221: bcf_hdr_seqnames: Assertion `tid<m' failed.
Aborted (core dumped)

While it works fine if converted to text file:

$ cat /tmp/main.vcf | /tmp/main | bcftools view -Oz > /tmp/main.vcf.gz && bcftools index -f -t /tmp/main.vcf.gz
$ bcftools isec -C /tmp/main.vcf.gz /tmp/main.vcf.gz
Note: -w option not given, printing list of sites...

Notice that in htslib/vcf.h the function bcf_hdr_remove() is explained as follows:

    /**
     *  bcf_hdr_remove() - remove VCF header tag
     *  @param type:      one of BCF_HL_*
     *  @param key:       tag name or NULL to remove all tags of the given type
     */
    HTSLIB_EXPORT
    void bcf_hdr_remove(bcf_hdr_t *h, int type, const char *key);

Originally posted by @freeseek in https://github.com/samtools/htslib/issues/842#issuecomment-1338682524

freeseek commented 1 year ago

For now I am replacing the code:

bcf_hdr_remove(hdr, BCF_HL_CTG, NULL);

with:

bcf_hdr_remove(hdr, BCF_HL_CTG, NULL);
kh_clear(vdict, hdr->dict[BCF_DT_CTG]);
for (int i=0; i<out->n[BCF_DT_CTG]; i++)
  free((void *)out->id[BCF_DT_CTG][i].key);
hdr->n[BCF_DT_CTG] = 0;

though it requires to redefine the kh_clear_vdict(h) function with:

KHASH_MAP_INIT_STR(vdict, bcf_idinfo_t)

while the hash tables should be opaque to the end users

pd3 commented 1 year ago

This is trickier than it seems. The bcf_hdr_remove() code is not cleaning the header entirely on purpose, because that would make it impossible to parse subsequent data records.

So one should not be accessing the hdr->n fields directly, they are not informative if the header has been edited. Instead, bcf_hdr_seqnames() should be used, I just pushed a commit that makes the function work in this scenario as well.

However, the bcftools isec case is still failing regardless of this at a different place, as described in this related issue https://github.com/samtools/htslib/issues/1534

freeseek commented 1 year ago

I think I understand the problem a bit better now. A simpler way to see what is going on is also to run:

$ cat /tmp/main.vcf | /tmp/main | zcat
before: rid=0 n_contigs=1
after: rid=1 n_contigs=2
BCF�##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed",IDX=0>
##contig=<ID=chr22,length=50818468,IDX=1>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO

Or again to run:

$ cat /tmp/main.vcf | /tmp/main | /tmp/main | zcat
before: rid=0 n_contigs=1
after: rid=1 n_contigs=2
before: rid=1 n_contigs=2
after: rid=2 n_contigs=3
BCF�##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed",IDX=0>
##contig=<ID=chr22,length=50818468,IDX=2>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO

The IDX keeps increasing for the contigs and skipping the initial values. Converting to text format resets IDX:

$ cat /tmp/main.vcf | /tmp/main | bcftools view --no-version | bcftools view --no-version -Ou
before: rid=0 n_contigs=1
after: rid=1 n_contigs=2
BCF�##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed",IDX=0>
##contig=<ID=chr22,length=50818468,IDX=0>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO

If the intended behavior of bcf_hdr_remove(hdr, BCF_HL_CTG, NULL) is not to reset IDX to still be able to parse subsequent data records, is there an alternative way to do so? For reproducibility purposes, I would prefer to have control into what kind of information goes into the output binary VCF

pd3 commented 1 year ago

I am not sure what you are after exactly. The header and the body must stay consistent, the IDX field was introduced specifically for cases like this. It should be a considered a bug that htslib is not handling it properly.

freeseek commented 1 year ago

In my specific case I am writing a BCFtools plugin, so I have one in header for reading the input VCF and one out header for writing the output VCF, so I don't need to parse data records with the out header. I would prefer to have an option to output a VCF without the missing IDX numbers. I would prefer the output VCF to have no unneeded memory of what was in the input VCF

pd3 commented 1 year ago

Tags and contigs in BCF body are identified by their id which is defined implicitly by the order of their definitions in the header. This brings a problem: if, say, the first tag is removed from the header, ids of the remaining tags change by -1, and the entire BCF has to be recoded. The IDX field was introduced to preserve tag ids even when some are removed or reordered. This is part of the BCF specification and all readers must support it, you will not gain anything by doing this extra work.

freeseek commented 1 year ago

It seems to me like you are thinking about protecting the end users by preserving the contig table in the header. But what are the end users expected to use the modified header with the now missing header records for? The old contigs' rid's can be interpreted while reading the VCF but if the modified header was the one written in the output the old rid's should not be output as the corresponding table's entries would be missing from the printed header and will be discarded from memory once the executable is over.

To be specific, my use case is a plugin that lifts over a VCF from one reference to another. The in header is used to interpret the old rid's and the out header has the contig table first cleaned up with:

bcf_hdr_remove(hdr, BCF_HL_CTG, NULL);
kh_clear(vdict, hdr->dict[BCF_DT_CTG]);
for (int i=0; i<out->n[BCF_DT_CTG]; i++)
  free((void *)out->id[BCF_DT_CTG][i].key);
hdr->n[BCF_DT_CTG] = 0;

and then filled with a new contig table from an index FASTA structure. This approach effectively resets the IDX field and solves the issue on my side. What would be a use case for using bcf_hdr_remove(hdr, BCF_HL_CTG, NULL) while keeping the contig table in memory?

I am mostly curious. I am not necessarily advocating for one way or another and I was trying to understand if there was a canonical way to remove both the dictionary and the contig table from a header structure.

pd3 commented 1 year ago

It is not about protecting end users, but about preserving the integrity of the BCF, about programming convenience and about the speed of processing. Also it is not only about contigs, but also about FILTER, INFO, and FORMAT tags. For FILTER,INFO,FORMAT this has been used for quite a while and it works, for contigs you happened to test a combination of steps that revealed a bug.

Say the VCF header looks like this

##INFO=<ID=TAG0,...>
##INFO=<ID=TAG1,...>
##INFO=<ID=TAG2,...>
chr1 ... TAG0=0;TAG1=1;TAG2=2

then the BCF body encodes the tags as this (shown here in a simplified form)

chr1 ... 0=0;1=1;2=2

If we remove the TAG1 from the header and from the body, the modified BCF looks like this

##INFO=<ID=TAG1,...,IDX=1>
##INFO=<ID=TAG2,...,IDX=2>
chr1 ... 1=1;2=2

If we accept your suggestion and discard the indexes, we'd have to manually recode all subsequent tags (IDX -> IDX-1) in the entire BCF and make it look like this

##INFO=<ID=TAG1,...>
##INFO=<ID=TAG2,...>
chr1 ... 0=1;1=2

Even if we decided we wanted to do it this way, at this point it's too late, this would be a major breaking change. Luckily, there is no need for that, we can just fix bcf_hdr_seqnames (https://github.com/pd3/htslib/commit/d64e710e891be3e5ecd89490d039228075172c9c) and the indexing code (https://github.com/samtools/htslib/issues/1534).

freeseek commented 1 year ago

Another way for what I was trying to say is that you would not use bcf_hdr_write(fp, hdr) after you run bcf_hdr_remove(hdr, BCF_HL_CTG, NULL) as the header contig table would not output anymore and the output VCF would be malformed. In any case, I am happy with the current solution and explanation. This issue can be closed

daviesrob commented 1 year ago

Fixed by #1535