Open pd3 opened 1 year ago
For now I am replacing the code:
bcf_hdr_remove(hdr, BCF_HL_CTG, NULL);
with:
bcf_hdr_remove(hdr, BCF_HL_CTG, NULL);
kh_clear(vdict, hdr->dict[BCF_DT_CTG]);
for (int i=0; i<out->n[BCF_DT_CTG]; i++)
free((void *)out->id[BCF_DT_CTG][i].key);
hdr->n[BCF_DT_CTG] = 0;
though it requires to redefine the kh_clear_vdict(h)
function with:
KHASH_MAP_INIT_STR(vdict, bcf_idinfo_t)
while the hash tables should be opaque to the end users
This is trickier than it seems. The bcf_hdr_remove()
code is not cleaning the header entirely on purpose, because that would make it impossible to parse subsequent data records.
So one should not be accessing the hdr->n
fields directly, they are not informative if the header has been edited. Instead, bcf_hdr_seqnames()
should be used, I just pushed a commit that makes the function work in this scenario as well.
However, the bcftools isec
case is still failing regardless of this at a different place, as described in this related issue https://github.com/samtools/htslib/issues/1534
I think I understand the problem a bit better now. A simpler way to see what is going on is also to run:
$ cat /tmp/main.vcf | /tmp/main | zcat
before: rid=0 n_contigs=1
after: rid=1 n_contigs=2
BCF�##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed",IDX=0>
##contig=<ID=chr22,length=50818468,IDX=1>
#CHROM POS ID REF ALT QUAL FILTER INFO
Or again to run:
$ cat /tmp/main.vcf | /tmp/main | /tmp/main | zcat
before: rid=0 n_contigs=1
after: rid=1 n_contigs=2
before: rid=1 n_contigs=2
after: rid=2 n_contigs=3
BCF�##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed",IDX=0>
##contig=<ID=chr22,length=50818468,IDX=2>
#CHROM POS ID REF ALT QUAL FILTER INFO
The IDX
keeps increasing for the contigs and skipping the initial values. Converting to text format resets IDX
:
$ cat /tmp/main.vcf | /tmp/main | bcftools view --no-version | bcftools view --no-version -Ou
before: rid=0 n_contigs=1
after: rid=1 n_contigs=2
BCF�##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed",IDX=0>
##contig=<ID=chr22,length=50818468,IDX=0>
#CHROM POS ID REF ALT QUAL FILTER INFO
If the intended behavior of bcf_hdr_remove(hdr, BCF_HL_CTG, NULL)
is not to reset IDX
to still be able to parse subsequent data records, is there an alternative way to do so? For reproducibility purposes, I would prefer to have control into what kind of information goes into the output binary VCF
I am not sure what you are after exactly. The header and the body must stay consistent, the IDX field was introduced specifically for cases like this. It should be a considered a bug that htslib is not handling it properly.
In my specific case I am writing a BCFtools plugin, so I have one in
header for reading the input VCF and one out
header for writing the output VCF, so I don't need to parse data records with the out
header. I would prefer to have an option to output a VCF without the missing IDX numbers. I would prefer the output VCF to have no unneeded memory of what was in the input VCF
Tags and contigs in BCF body are identified by their id which is defined implicitly by the order of their definitions in the header. This brings a problem: if, say, the first tag is removed from the header, ids of the remaining tags change by -1, and the entire BCF has to be recoded. The IDX field was introduced to preserve tag ids even when some are removed or reordered. This is part of the BCF specification and all readers must support it, you will not gain anything by doing this extra work.
It seems to me like you are thinking about protecting the end users by preserving the contig table in the header. But what are the end users expected to use the modified header with the now missing header records for? The old contigs' rid's can be interpreted while reading the VCF but if the modified header was the one written in the output the old rid's should not be output as the corresponding table's entries would be missing from the printed header and will be discarded from memory once the executable is over.
To be specific, my use case is a plugin that lifts over a VCF from one reference to another. The in
header is used to interpret the old rid's and the out
header has the contig table first cleaned up with:
bcf_hdr_remove(hdr, BCF_HL_CTG, NULL);
kh_clear(vdict, hdr->dict[BCF_DT_CTG]);
for (int i=0; i<out->n[BCF_DT_CTG]; i++)
free((void *)out->id[BCF_DT_CTG][i].key);
hdr->n[BCF_DT_CTG] = 0;
and then filled with a new contig table from an index FASTA structure. This approach effectively resets the IDX field and solves the issue on my side. What would be a use case for using bcf_hdr_remove(hdr, BCF_HL_CTG, NULL)
while keeping the contig table in memory?
I am mostly curious. I am not necessarily advocating for one way or another and I was trying to understand if there was a canonical way to remove both the dictionary and the contig table from a header structure.
It is not about protecting end users, but about preserving the integrity of the BCF, about programming convenience and about the speed of processing. Also it is not only about contigs, but also about FILTER, INFO, and FORMAT tags. For FILTER,INFO,FORMAT this has been used for quite a while and it works, for contigs you happened to test a combination of steps that revealed a bug.
Say the VCF header looks like this
##INFO=<ID=TAG0,...>
##INFO=<ID=TAG1,...>
##INFO=<ID=TAG2,...>
chr1 ... TAG0=0;TAG1=1;TAG2=2
then the BCF body encodes the tags as this (shown here in a simplified form)
chr1 ... 0=0;1=1;2=2
If we remove the TAG1 from the header and from the body, the modified BCF looks like this
##INFO=<ID=TAG1,...,IDX=1>
##INFO=<ID=TAG2,...,IDX=2>
chr1 ... 1=1;2=2
If we accept your suggestion and discard the indexes, we'd have to manually recode all subsequent tags (IDX -> IDX-1) in the entire BCF and make it look like this
##INFO=<ID=TAG1,...>
##INFO=<ID=TAG2,...>
chr1 ... 0=1;1=2
Even if we decided we wanted to do it this way, at this point it's too late, this would be a major breaking change. Luckily, there is no need for that, we can just fix bcf_hdr_seqnames
(https://github.com/pd3/htslib/commit/d64e710e891be3e5ecd89490d039228075172c9c) and the indexing code (https://github.com/samtools/htslib/issues/1534).
Another way for what I was trying to say is that you would not use bcf_hdr_write(fp, hdr)
after you run bcf_hdr_remove(hdr, BCF_HL_CTG, NULL)
as the header contig table would not output anymore and the output VCF would be malformed. In any case, I am happy with the current solution and explanation. This issue can be closed
Fixed by #1535
Somehow this issue https://github.com/samtools/htslib/issues/842 has not been fully addressed. The following code shows the problem:
Then compling:
Create a toy VCF file:
Then running a first test:
Shows that the number of contigs is increasing even if the code removed all contigs with the
bcf_hdr_remove()
functionThe number of contigs increases at each iteration:
Unless the VCF is converted to text file in between:
This must not be the intended behavior as it leads to unwanted scenarios:
While it works fine if converted to text file:
Notice that in
htslib/vcf.h
the functionbcf_hdr_remove()
is explained as follows:Originally posted by @freeseek in https://github.com/samtools/htslib/issues/842#issuecomment-1338682524