samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
662 stars 240 forks source link

Problem with binomial calculation #2186

Closed chundruv closed 2 months ago

chundruv commented 4 months ago

Hi Petr,

I noticed that when calculating the binomial of allele depth there is an issue when there is a multiallelic variant, which is not a problem when normalising first.

I had one position which was set to missing with the setGT when done before splitting, and not after splitting. It was a 0/6 genotype, the LAD was 19,21, and the GQ was 15. When you print out the binom before splitting it is giving a “.”, whereas after splitting it is giving the actual probabilities. However, when I manually change it to 0/1, it gives the correct probability without splitting so it’s a problem with the non-0/1 genotypes, it doesn’t like that.

Also, is there an option when splitting multiallelics to set the non-ref calls on other alleles to missing? Just thinking that if you set it to homref it might not be the best for some analyses. Also, is there a way to update the DP and other format fields when splitting?

Thanks, Karitk

pd3 commented 4 months ago

Any chance you could provide a concrete test case? Since you are quoting LAD, it is not clear what exactly the input data is like.

Also, is there an option when splitting multiallelics to set the non-ref calls on other alleles to missing? Just thinking that if you set it to homref it might not be the best for some analyses.

The command bcftools norm has the option

--multi-overlaps 0|.        Fill in the reference (0) or missing (.) allele when splitting multiallelics [0]

Also, is there a way to update the DP and other format fields when splitting?

The program does not attempt to recalculate DP from the split AD values, if that's what you mean. However, tere is the plugin +fill-tags which allows to do that

bcftools +fill-tags in.bcf -Ob -o out.bcf -- -t 'FORMAT/DP:1=int(smpl_sum(FORMAT/AD))'
IdoBar commented 3 months ago

I'm having a similar problem - trying to filter "potential heterozygote" genotype calls from haploid-called VCF. The site I'm looking at has genotypes call like the following - 0:12:7,5:7:280:5:184:0,-7.69402 (with this FORMAT: GT:DP:AD:RO:QR:AO:QA:GL) I've tried to reset it to missing with bcftools filter -S . -e 'binom(FMT/AD)>0.2', but it still keeps it in. I also tried to print the values of the binomial test (according to #1049), but it returns ., not a p-value:

$ bcftools query  -f'[%CHROM:%POS %SAMPLE %GT %PBINOM(AD)\n]' example_het_sites_for_binomial_test.recode.vcf
ArME14_ctg_22:331265 Ar22438 1 .
ArME14_ctg_22:331265 Ar22466 . .
ArME14_ctg_22:331265 Ar22548 1 .
ArME14_ctg_22:331265 Ar22039 1 .
ArME14_ctg_22:331265 Ar22025 1 .
ArME14_ctg_22:331265 Ar22588 1 .
ArME14_ctg_22:331265 Ar22019 1 .
ArME14_ctg_22:331265 Ar22182 . .
ArME14_ctg_22:331265 Ar22348 1 .
ArME14_ctg_22:331265 Ar22109 1 .
ArME14_ctg_22:331265 Ar22636 . .

Thanks, Ido

(You can try on this toy set that includes just one site) example_het_sites_for_binomial_test.recode.zip

chundruv commented 3 months ago

Sorry for the long delay @pd3 Attaching a test vcf test.vcf.zip

The test commands are: filter before splitting: bcftools +setGT -Ou test.vcf -- -t q -i 'binom(FMT/LAD)<=0.001' -n . | bcftools norm -m - | bcftools view -o before_split.vcf filter after splitting: bcftools norm -Ou -m- test.vcf | bcftools +setGT -Ou -- -t q -i 'binom(FMT/LAD)<=0.001' -n . | bcftools view -o after_split.vcf

Hope this helps, let me know if I can do anything else

pd3 commented 2 months ago

@chundruv There is a problem with the input VCF: the tag LAA is defined as Type=String but it must be defined as Type=Integer. I just pushed a commit which prints more informative error message.

After the VCF is fixed, one can debug the binomial probabilities with bcftools query. For example, we can run this on your test file:

$ bcftools +tag2tag test.vcf -- --LXX-to-XX | bcftools query -f '[%GT   %AD   %PBINOM(AD)]'
0/6   19,.,.,.,.,.,21,.,.,.,.   0.58176

The phred-scaled binomial probability in this case is 0.58176, which is the same value as computed with R

> -10*log10(binom.test(21,19+21,a="t")$p.value)
[1] 0.5817597
pd3 commented 2 months ago

@IdoBar The problem is that the VCF has haploid genotypes: the binom(AD) function selects the input values based on the GT tag, so in this case it has nothing to work with. I am thinking whether it's worthwhile to make the program automatically take AD[0] for haploid genotypes, or if this is something the user should take care of.

chundruv commented 2 months ago

Hi Petr,

Thanks for looking at this. I tried the two commands I pasted in my last comment after changing LAA to integer in the header and the problem is still there. Even when I convert to AD instead of LAD.

Best, Kartik

chundruv commented 2 months ago

Sorry! It does work after switching to AD from LAD, I forgot to use AD instead of LAD. But it still is an issue when using LAD even with the correct LAA header

pd3 commented 2 months ago

As described in the documentation, the function can take one argument (binom(FMT/AD)) and in that case the indices of values to use are determined from GT. Or one can define the fields explicitly

binom(FMT/AD)                .. GT can be used to determine the correct index
binom(AD[0],AD[1])           .. or the fields can be given explicitly

so in your case you would do for example

$ bcftools query test.vcf -f '[%GT   %LAD]' -i 'phred(binom(LAD[:0],LAD[:1]))>0.58175'
0/6   19,21
IdoBar commented 2 months ago

@IdoBar The problem is that the VCF has haploid genotypes: the binom(AD) function selects the input values based on the GT tag, so in this case it has nothing to work with. I am thinking whether it's worthwhile to make the program automatically take AD[0] for haploid genotypes, or if this is something the user should take care of.

Thanks for the explanation @pd3, I thought that binom(AD) takes the information directly from the AD tag, which has the counts for both alleles. I guess I can try to specify the fields explicitly, like in your example above (binom(AD[0],AD[1])) and see if it works.

Cheers, Ido

pd3 commented 2 months ago

AD is defined as Number=R, there can be more than two values. That's why GT step is required.

pd3 commented 2 months ago

I believe this is resolved now, please shout if not.