samtools / htslib

C library for high-throughput sequencing data formats
Other
799 stars 445 forks source link

VCF header parser has O(n^2) complexity in the number of generic header lines #1543

Closed freeseek closed 1 year ago

freeseek commented 1 year ago

The following bash script:

$ (echo "##fileformat=VCFv4.2"
for i in {1..65536}; do echo "##key=$i"; done
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO") | \
  bcftools view -Ou -o /dev/null

Takes a very long time to run.

But the following bash script:

$ (echo "##fileformat=VCFv4.2"
for i in {1..65536}; do echo "##key=0"; done
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO") | \
  bcftools view -Ou -o /dev/null

And the following bash script:

$ (echo "##fileformat=VCFv4.2"
for i in {1..65536}; do echo "##key=<ID=$i>"; done
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO") | \
  bcftools view -Ou -o /dev/null

Run almost instantaneously.

I believe the problem is in bcf_hdr_add_hrec() from htslib/vcf.c:

int bcf_hdr_add_hrec(bcf_hdr_t *hdr, bcf_hrec_t *hrec)
{
    int res;
    if ( !hrec ) return 0;

    bcf_hrec_check(hrec);   // todo: check return status and propagate errors up

    res = bcf_hdr_register_hrec(hdr,hrec);
    if (res < 0) return -1;
    if ( !res )
    {
        // If one of the hashed field, then it is already present
        if ( hrec->type != BCF_HL_GEN )
        {
            bcf_hrec_destroy(hrec);
            return 0;
        }

        // Is one of the generic fields and already present?
        int i;
        for (i=0; i<hdr->nhrec; i++)
        {
            if ( hdr->hrec[i]->type!=BCF_HL_GEN ) continue;
            if ( !strcmp(hdr->hrec[i]->key,hrec->key) && !strcmp(hrec->key,"fileformat") ) break;
            if ( !strcmp(hdr->hrec[i]->key,hrec->key) && !strcmp(hdr->hrec[i]->value,hrec->value) ) break;
        }
        if ( i<hdr->nhrec )
        {
            bcf_hrec_destroy(hrec);
            return 0;
        }
    }

    // New record, needs to be added
    int n = hdr->nhrec + 1;
    bcf_hrec_t **new_hrec = realloc(hdr->hrec, n*sizeof(bcf_hrec_t*));
    if (!new_hrec) return -1;
    hdr->hrec = new_hrec;
    hdr->hrec[hdr->nhrec] = hrec;
    hdr->dirty = 1;
    hdr->nhrec = n;

    return hrec->type==BCF_HL_GEN ? 0 : 1;
}

For generic header lines in the for (i=0; i<hdr->nhrec; i++) loop the header line gets compared to the all previous generic header lines until a duplicate is found. This means the header parser has O(n^2) complexity in the number of generic header lines.

Not sure what to suggest as I don't understand what the goal of avoiding duplicate generic header lines is.

freeseek commented 1 year ago

Indeed, there was already a comment in htslib/vcf.c related to this issue:

/*
 *  Note that while querying of FLT,INFO,FMT,CTG lines is fast (the keys are hashed),
 *  the STR,GEN lines are searched for linearly in a linked list of all header lines.
 *  This may become a problem for VCFs with huge headers, we might need to build a
 *  dictionary for these lines as well.
 */
bcf_hrec_t *bcf_hdr_get_hrec(const bcf_hdr_t *hdr, int type, const char *key, const char *value, const char *str_class)
pd3 commented 1 year ago

That's right. This was not considered a big problem because in practice the time spent on parsing the header constitutes only a fraction of the overall time spent on parsing the whole VCF.

Nevertheless, possible solutions:

Only the first solution also helps to improve performance of bcf_hdr_get_hrec, but all would help to address this specific issue.

daviesrob commented 1 year ago

Is this a theoretical problem, or one you've hit on a real file? If theoretical then I think it might be best to ignore it. If it's real then adding a dictionary would be the easiest fix. Happily I note that bcf_hdr_t::dict is a void *, which gives a fairly easy way of stashing extra data without changing the ABI. We'd just need to make a struct in vcf.c with a vdict_t as the first element, and then use that for dict[0]. A simple accessor function could be used to hide the messy details, and we'd be able to store as much extra header data as we want.

freeseek commented 1 year ago

I encoutered the problem with Genozip DVCF which can easily create headers with >1M ##primary_only= header lines following the DVCF specification. I think the main problem is that the VCF specification does not force a limit on the number of generic header lines, so this could happen again. Either the future VCF specification should be updated or the implementation should be updated to handle these corner cases.

daviesrob commented 1 year ago

Hmm, looks like it would be worth making it more efficient then.