Closed freeseek closed 1 year ago
Indeed, there was already a comment in htslib/vcf.c
related to this issue:
/*
* Note that while querying of FLT,INFO,FMT,CTG lines is fast (the keys are hashed),
* the STR,GEN lines are searched for linearly in a linked list of all header lines.
* This may become a problem for VCFs with huge headers, we might need to build a
* dictionary for these lines as well.
*/
bcf_hrec_t *bcf_hdr_get_hrec(const bcf_hdr_t *hdr, int type, const char *key, const char *value, const char *str_class)
That's right. This was not considered a big problem because in practice the time spent on parsing the header constitutes only a fraction of the overall time spent on parsing the whole VCF.
Nevertheless, possible solutions:
bcf_hdr_t
to include a hash dictionary. This is an ABI breaking change and if we should go this way, it would be good to do so by adding a pointer to opaque auxiliary data for future modifications, similarly to bcf_srs_t.aux
in synced_bcf_reader.h.Only the first solution also helps to improve performance of bcf_hdr_get_hrec
, but all would help to address this specific issue.
Is this a theoretical problem, or one you've hit on a real file? If theoretical then I think it might be best to ignore it. If it's real then adding a dictionary would be the easiest fix. Happily I note that bcf_hdr_t::dict
is a void *
, which gives a fairly easy way of stashing extra data without changing the ABI. We'd just need to make a struct in vcf.c with a vdict_t
as the first element, and then use that for dict[0]
. A simple accessor function could be used to hide the messy details, and we'd be able to store as much extra header data as we want.
I encoutered the problem with Genozip DVCF which can easily create headers with >1M ##primary_only=
header lines following the DVCF specification. I think the main problem is that the VCF specification does not force a limit on the number of generic header lines, so this could happen again. Either the future VCF specification should be updated or the implementation should be updated to handle these corner cases.
Hmm, looks like it would be worth making it more efficient then.
The following bash script:
Takes a very long time to run.
But the following bash script:
And the following bash script:
Run almost instantaneously.
I believe the problem is in
bcf_hdr_add_hrec()
fromhtslib/vcf.c
:For generic header lines in the
for (i=0; i<hdr->nhrec; i++)
loop the header line gets compared to the all previous generic header lines until a duplicate is found. This means the header parser hasO(n^2)
complexity in the number of generic header lines.Not sure what to suggest as I don't understand what the goal of avoiding duplicate generic header lines is.