Closed chundruv closed 2 months ago
Any chance you could provide a concrete test case? Since you are quoting LAD, it is not clear what exactly the input data is like.
Also, is there an option when splitting multiallelics to set the non-ref calls on other alleles to missing? Just thinking that if you set it to homref it might not be the best for some analyses.
The command bcftools norm
has the option
--multi-overlaps 0|. Fill in the reference (0) or missing (.) allele when splitting multiallelics [0]
Also, is there a way to update the DP and other format fields when splitting?
The program does not attempt to recalculate DP from the split AD values, if that's what you mean. However, tere is the plugin +fill-tags
which allows to do that
bcftools +fill-tags in.bcf -Ob -o out.bcf -- -t 'FORMAT/DP:1=int(smpl_sum(FORMAT/AD))'
I'm having a similar problem - trying to filter "potential heterozygote" genotype calls from haploid-called VCF.
The site I'm looking at has genotypes call like the following - 0:12:7,5:7:280:5:184:0,-7.69402
(with this FORMAT: GT:DP:AD:RO:QR:AO:QA:GL
)
I've tried to reset it to missing with bcftools filter -S . -e 'binom(FMT/AD)>0.2'
, but it still keeps it in.
I also tried to print the values of the binomial test (according to #1049), but it returns .
, not a p-value:
$ bcftools query -f'[%CHROM:%POS %SAMPLE %GT %PBINOM(AD)\n]' example_het_sites_for_binomial_test.recode.vcf
ArME14_ctg_22:331265 Ar22438 1 .
ArME14_ctg_22:331265 Ar22466 . .
ArME14_ctg_22:331265 Ar22548 1 .
ArME14_ctg_22:331265 Ar22039 1 .
ArME14_ctg_22:331265 Ar22025 1 .
ArME14_ctg_22:331265 Ar22588 1 .
ArME14_ctg_22:331265 Ar22019 1 .
ArME14_ctg_22:331265 Ar22182 . .
ArME14_ctg_22:331265 Ar22348 1 .
ArME14_ctg_22:331265 Ar22109 1 .
ArME14_ctg_22:331265 Ar22636 . .
Thanks, Ido
(You can try on this toy set that includes just one site) example_het_sites_for_binomial_test.recode.zip
Sorry for the long delay @pd3 Attaching a test vcf test.vcf.zip
The test commands are:
filter before splitting: bcftools +setGT -Ou test.vcf -- -t q -i 'binom(FMT/LAD)<=0.001' -n . | bcftools norm -m - | bcftools view -o before_split.vcf
filter after splitting: bcftools norm -Ou -m- test.vcf | bcftools +setGT -Ou -- -t q -i 'binom(FMT/LAD)<=0.001' -n . | bcftools view -o after_split.vcf
Hope this helps, let me know if I can do anything else
@chundruv There is a problem with the input VCF: the tag LAA is defined as Type=String
but it must be defined as Type=Integer
. I just pushed a commit which prints more informative error message.
After the VCF is fixed, one can debug the binomial probabilities with bcftools query
. For example, we can run this on your test file:
$ bcftools +tag2tag test.vcf -- --LXX-to-XX | bcftools query -f '[%GT %AD %PBINOM(AD)]'
0/6 19,.,.,.,.,.,21,.,.,.,. 0.58176
The phred-scaled binomial probability in this case is 0.58176, which is the same value as computed with R
> -10*log10(binom.test(21,19+21,a="t")$p.value)
[1] 0.5817597
@IdoBar The problem is that the VCF has haploid genotypes: the binom(AD)
function selects the input values based on the GT tag, so in this case it has nothing to work with. I am thinking whether it's worthwhile to make the program automatically take AD[0] for haploid genotypes, or if this is something the user should take care of.
Hi Petr,
Thanks for looking at this. I tried the two commands I pasted in my last comment after changing LAA to integer in the header and the problem is still there. Even when I convert to AD instead of LAD.
Best, Kartik
Sorry! It does work after switching to AD from LAD, I forgot to use AD instead of LAD. But it still is an issue when using LAD even with the correct LAA header
As described in the documentation, the function can take one argument (binom(FMT/AD)
) and in that case the indices of values to use are determined from GT. Or one can define the fields explicitly
binom(FMT/AD) .. GT can be used to determine the correct index
binom(AD[0],AD[1]) .. or the fields can be given explicitly
so in your case you would do for example
$ bcftools query test.vcf -f '[%GT %LAD]' -i 'phred(binom(LAD[:0],LAD[:1]))>0.58175'
0/6 19,21
@IdoBar The problem is that the VCF has haploid genotypes: the
binom(AD)
function selects the input values based on the GT tag, so in this case it has nothing to work with. I am thinking whether it's worthwhile to make the program automatically take AD[0] for haploid genotypes, or if this is something the user should take care of.
Thanks for the explanation @pd3, I thought that binom(AD)
takes the information directly from the AD
tag, which has the counts for both alleles. I guess I can try to specify the fields explicitly, like in your example above (binom(AD[0],AD[1])
) and see if it works.
Cheers, Ido
AD is defined as Number=R
, there can be more than two values. That's why GT step is required.
I believe this is resolved now, please shout if not.
Hi Petr,
I noticed that when calculating the binomial of allele depth there is an issue when there is a multiallelic variant, which is not a problem when normalising first.
I had one position which was set to missing with the setGT when done before splitting, and not after splitting. It was a 0/6 genotype, the LAD was 19,21, and the GQ was 15. When you print out the binom before splitting it is giving a “.”, whereas after splitting it is giving the actual probabilities. However, when I manually change it to 0/1, it gives the correct probability without splitting so it’s a problem with the non-0/1 genotypes, it doesn’t like that.
Also, is there an option when splitting multiallelics to set the non-ref calls on other alleles to missing? Just thinking that if you set it to homref it might not be the best for some analyses. Also, is there a way to update the DP and other format fields when splitting?
Thanks, Karitk