samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
652 stars 240 forks source link

binom(AD) filtering best practice #1049

Open ad2n15 opened 5 years ago

ad2n15 commented 5 years ago

I am filtering a multi VCF file of > 1000 sapmples, to identify somatic variants. So, I want to include variants which achieve this criteria:

1) SNPs 2) Observed once or twice (minor allele frequency less than 0.01%) in the cohort 3) Allelic fraction above 10% 4) Failed the hypothesis that the alternate allelic count was distributed as a binomial process with mean 45% with a designed false positive rate of 10-5

The last point is tricky for me.

Kind Regards

pd3 commented 5 years ago

Unfortunately, such feature is not available in bcftools. You could do the following though.

The binom() function only calculates the binomial probability and assumes the mean 0.5. If you want to correct for multiple testing in order to meet the desired FDR rate, you could modify the expected mean here https://github.com/samtools/bcftools/blob/develop/convert.c#L1176 and then use something like

bcftools query -i'GT="het"' -f'[%CHROM:%POS %SAMPLE %GT %pbinom(AD)\n]' file.vcf

to get the list of probabilities for each tested heterozygous genotype. Then it is possible to determine the Benjamini-Hochberg correction for filtering. The expected mean would need to be adjusted here https://github.com/samtools/bcftools/blob/develop/filter.c#L1360.

A possible improvement to bcftools would be to allow passing the expected mean as a parameter to binom() so that you wouldn't have to hack the code.