samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
634 stars 241 forks source link

FeatureRequest: filling in INFO/QD and INFO/FS attributes #2164

Closed WimSpee closed 2 months ago

WimSpee commented 2 months ago

Dear BCFTools developers,

Some modern very large scale multi-sample variant calling methods opt for a very efficient merge of gVCF files to create the multi-sample VCF file.

Calculation of the INFO/QD and INFO/FS attributes has been dropped for performance reasons.

These INFO/QD and INFO/FS attributes have proven useful to us for setting filter labels on likely false positive variants. INFO/QD was used the most to set filter labels on variants in our experience, followed by INFO/FS.

I am wondering if it would be possible to fill these tags via bcftools. Similar to the current bcftools fill-tags plugin functionality.

INFO/QD

The formula for QD is as straight forward as QUAL/DP if I understand it correct.
The INFO/QUAL and INFO/DP values are present in the input multi-sample VCF file.

image

INFO/FS

INFO/FS seems to be based on the FMT/SB values via some formula. I am not sure of the formula, maybe you know, or I could search for it. The per sample FMT/SB values are present in the input multi-sample VCF file.

##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias">

Thank you for considering this request.

pd3 commented 2 months ago

It very depends on the information stored in the VCF. For QD there needs to be INFO/DP and QUAL. These are usually present and such filtering can be done on the fly with something like

bcftools view -i 'QUAL/DP>0.1'

With FS, there the VCF would have to have coverage for reference and alternate reads on both strands, typically FORMAT/ADF and FORMAT/ADR. Is this information present in your VCFs?

WimSpee commented 2 months ago

INFO/QD It looks like QD can already be filled via this bcftools fill-tags command. bcftools +fill-tags ON_amplicon.vcf.gz -- -t 'INFO/QD:1=QUAL/INFO/DP

So nothing new needed for QD.

INFO/FS Our input multi-sample VCF files don't have FORMAT/ADF or FORMAT/ADR.

FORMAT/SB. is present in the multi-sample VCF.

##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias">

According to GATK this is the order of the SB values. The software we are using is originally derived from GATK, so I think the order is the same. Loading the CRAM and VCF file in IGV confirms this order of the values for the SB attribute.

/**
 * Number of forward and reverse reads that support REF and ALT alleles
 *
 * <p>Strand bias is a type of sequencing bias in which one DNA strand is favored over the other, which can result in incorrect evaluation of the amount of evidence observed for one allele vs. the other. The StrandBiasBySample annotation produces read counts per allele and per strand that are used by other annotation modules (FisherStrand and StrandOddsRatio) to estimate strand bias using statistical approaches.
 *
 * <p>This annotation produces 4 values, corresponding to the number of reads that support the following (in that order):</p>
 * <ul>
 *     <li>the reference allele on the forward strand</li>
 *     <li>the reference allele on the reverse strand</li>
 *     <li>the alternate allele on the forward strand</li>
 *     <li>the alternate allele on the reverse strand</li>
 * </ul>
 *
 * <h3>Example</h3>
 * <pre>GT:AD:GQ:PL:SB  0/1:53,51:99:1758,0,1835:23,30,33,18</pre>
 * <p>In this example, the reference allele is supported by 23 forward reads and 30 reverse reads, the alternate allele is supported by 33 forward reads and 18 reverse reads.</p>
 *
 * <h3>Caveats</h3>
 * <ul>
 *     <li>This annotation can only be generated by HaplotypeCaller (it will not work when called from VariantAnnotator).</li>
 *     <li>StrandBiasBySample is an intermediate annotation used for calculating FisherStrand (FS) and StrandOddsRatio (SOR), so it will not show up in VCFs after using the GenotypeGVCFs tool
            unless `--keep-combined-raw-annotations` is specified.</li>
 * </ul>
 *
 * <h3>Related annotations</h3>
 * <ul>
 *     <li><b>FisherStrand</b> uses Fisher's Exact Test to evaluate strand bias.</li>
 *     <li><b>StrandOddsRatio</b> is an updated form of FisherStrand that uses a symmetric odds ratio calculation.</li>
 * </ul>

Source of the above information: https://github.com/broadinstitute/gatk/blob/aed8b1b213e66dc1327bcfda27ce85cc53cbf47a/src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/StrandBiasBySample.java#L31

Here is how the SB attribute looks for an input multi-sample VCF file with 6 samples.

bcftools view -i "INFO/DP>100" input.vcf.gz |  bcftools query -f '[\t%SB]\n' | grep -v "\."   | head  -n 20
        0,0,0,0 0,0,0,0 0,0,0,0 0,0,0,0 70,34,7,8       56,32,9,5
        0,0,0,0 0,0,0,0 0,0,0,0 0,0,0,0 26,56,2,9       23,51,1,10
        0,0,0,0 0,0,0,0 0,0,0,0 0,0,0,0 0,0,30,62       0,0,26,58
        0,0,1,1 0,0,1,1 0,0,41,76       0,0,37,55       0,0,9,13        0,0,9,12
        0,0,1,1 0,0,1,1 0,0,41,76       0,0,37,55       0,0,9,13        0,0,9,12
        0,0,1,2 0,0,1,2 0,0,28,71       0,0,26,55       0,0,7,14        0,0,7,13
        0,0,1,2 0,0,1,2 0,0,20,56       0,0,18,44       0,0,10,13       0,0,10,12
        0,0,1,2 0,0,1,2 0,0,16,50       0,0,15,42       0,0,10,10       0,0,10,10
        0,0,1,2 0,0,1,2 0,0,16,22       0,0,16,20       0,0,6,7 0,0,6,7
        0,0,2,1 0,0,1,1 0,0,17,10       0,0,16,8        0,0,8,6 0,0,6,7
        0,0,1,2 0,0,1,1 0,0,15,21       0,0,13,20       0,0,4,10        0,0,6,7
        0,0,1,2 0,0,0,2 0,0,19,20       0,0,16,18       0,0,6,10        0,0,5,9
        0,0,4,0 0,0,2,0 0,0,4,0 0,0,4,0 0,0,3,0 0,0,3,0
        0,0,11,0        0,0,7,1 0,0,11,0        0,0,11,0        0,0,8,0 0,0,8,0
        41,6,14,3       74,17,6,4       0,0,0,0 0,0,0,0 0,0,0,0 0,0,0,0
        13,4,44,7       6,5,79,17       1,0,163,27      0,0,96,31       0,0,175,29      0,1,88,35
        6,7,132,225     6,6,165,179     0,5,227,176     3,0,173,173     2,3,229,168     2,5,165,165
        38,151,41,68    46,95,39,47     30,86,43,60     36,62,40,58     22,76,44,58     42,75,33,51
        0,0,0,0 0,0,0,0 61,90,8,53      58,78,13,34     0,0,0,0 55,74,13,45
        0,0,0,0 0,0,0,0 61,90,5,50      59,79,8,27      0,0,0,0 55,77,11,38
pd3 commented 2 months ago

The header line claims it computes Fisher's exact test and the code fragment supports that view https://github.com/broadinstitute/gatk/blob/aed8b1b213e66dc1327bcfda27ce85cc53cbf47a/src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/StrandBiasBySample.java#L91-L93. SB is probably a phred-scale p-value of it. In any case, there does not seem any work for bcftools here. In order to compute it, we'd need to have the reference and alternate read counts for both strands.

I believe this resolves the issue.