samtools / htslib

C library for high-throughput sequencing data formats
Other
799 stars 445 forks source link

Initial test framework for #1543 (to be continued) #1548

Closed pd3 closed 1 year ago

pd3 commented 1 year ago

This is an initial test framework as suggested by https://github.com/samtools/htslib/issues/1543#issuecomment-1373575003

@daviesrob is this what you had in mind?

daviesrob commented 1 year ago

Yes, something like that, but you currently leak a hash table. The allocation code should I think look like this:

    h->dict[0] = calloc(1,sizeof(bcf_hdr_aux_t));
    if (!h->dict[0]) goto fail;
    for (i = 1; i < 3; ++i)
        if ((h->dict[i] = kh_init(vdict)) == NULL) goto fail;

And we'd need to take care that the code to free it up does the right thing with the franken-structure.

We should also add a static inline function to get the bcf_hdr_aux_t out of the header, so we can hide away the grotty details, e.g.:

static inline bcf_hdr_aux_t * get_hdr_aux(bcf_hdr_t *h) {
    return (bcf_hdr_aux_t *) h->dict[0];
}
pd3 commented 1 year ago

Yes, something like that, but you currently leak a hash table. The allocation code should I think look like this:

    h->dict[0] = calloc(1,sizeof(bcf_hdr_aux_t));
    if (!h->dict[0]) goto fail;
    for (i = 1; i < 3; ++i)
        if ((h->dict[i] = kh_init(vdict)) == NULL) goto fail;

And we'd need to take care that the code to free it up does the right thing with the franken-structure.

I think it just needs this single line

free(h->dict[0]);

Right now the aux structure does not allocate anything is freed by free(h->id[i]); in bcf_hdr_destroy. Later it will need to free any custom data

if ( i==0 ) { /* free custom aux data */ }
daviesrob commented 1 year ago

Yes, but now it allocates something just to instantly throw it away. It's easier just to make the thing you wanted in the first place.

pd3 commented 1 year ago

Yes, but does not require knowing what is happening inside kh_init, so arguably it's cleaner and easier to understand.

pd3 commented 1 year ago

I forced pushed an updated version that is resolving the issue #1543 in my tests

jkbonfield commented 1 year ago

This is a somewhat philosophical question perhaps, but why are we looking for and removing duplicates in the first place? As far as I can see, the VCF spec does not disallow duplicate unstructured header lines. I don't think this is an error by omission either, as it does explicitly state structured header lines must have unique IDs.

I don't know what the effect would be though. FWIW Picard reorders the lines (now alphabetical sorted) and removes duplicates. I assume that also means the ordering of header lines is irrelevant too?

If duplicates are an error, maybe we should raise an issue with the VCF spec?

pd3 commented 1 year ago

It is not an error to have duplicate lines but it makes sense to prevent them. VCF does not require a specific order, with the exception of fileformat being first and PASS as the first filter.

jkbonfield commented 1 year ago

I was just wondering if there was a simpler alternative, which was enforcing uniqueness checking at all and just regurgitating whatever lines the user had in their input data. No uniqueness checks on unstructured data does obviously solve the O(N^2) issue, but yes it's probably a thing worth doing.

So if we feel it's something that needs doing, I would recommend adding it to the VCF spec as it's not really good to be enforcing our own undocumented rules on top of the public specification. (And it appears Picard is already in agreement with our strategy too.)

PS. Hmm, are you sure on PASS being required to be the first filter? I couldn't find this in the VCF spec, and Picard doesn't honour it.

@ seq4c[samtools.../htslib]; head /tmp/b.vcf
##fileformat=VCFv4.1
##fileDate=20150126
##reference=hs37d5
##phasing=partial
##contig=<ID=1,length=123456789>
##FILTER=<ID=PASSS,Description="pass">
##FILTER=<ID=VQSRTrancheSNP99.00to99.90,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -6.6778 <= x < -0.6832">
##FILTER=<ID=INDEL_SPECIFIC_FILTERS,Description="QD < 2.0 || ReadPosRankSum < -20.0 || InbreedingCoeff < -0.8 || FS > 200.0">
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=VQSRTrancheSNP99.90to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -36469.5723">

@ seq4c[samtools.../htslib]; \java -jar ~/work/picard-2.26.11.jar FilterVcf I=/tmp/b.vcf O=/tmp/_.vcf
INFO    2023-01-30 16:51:30 FilterVcf   
...

@ seq4c[samtools.../htslib]; head /tmp/_.vcf
##fileformat=VCFv4.2
##FILTER=<ID=AllGtsFiltered,Description="Site filtered out because all genotypes are filtered out.">
##FILTER=<ID=AlleleBalance,Description="Heterozygote allele balance below required threshold.">
##FILTER=<ID=INDEL_SPECIFIC_FILTERS,Description="QD < 2.0 || ReadPosRankSum < -20.0 || InbreedingCoeff < -0.8 || FS > 200.0">
##FILTER=<ID=LowQD,Description="Site exhibits QD value below a hard limit.">
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=PASSS,Description="pass">
##FILTER=<ID=StrandBias,Description="Site exhibits excessive allele/strand correlation.">
##FILTER=<ID=VQSRTrancheSNP99.00to99.90,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -6.6778 <= x < -0.6832">
##FILTER=<ID=VQSRTrancheSNP99.90to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -36469.5723">

However if it's an htslib bug then it's a diversion from this PR and should be opened as a new issue.

Edit: Oh I see a bit more now. Picard actually removed the PASS line. So it wasn't just a case of it reordering things. Checking the spec again though I see an example file using the PASS filter without a FILTER= line to define it, so it's clearly a legal thing to do. Sadly this is another case of specification-via-example as there's nothing in the spec to explicitly say that PASS can be omitted from the header.