Closed sleibovitz closed 1 week ago
I don't understand what is the question really, but it seems the filtering expression is formatted incorrectly. For example, GQ<20 & GQ="."
can never be true at the same. Perhaps you meant GQ<20 | GQ="."
?
I want to set 0/0 GT to ./. or . and keep only genotypes that have GQ>=20, DP >=10 and 0.2<AB<0.8 (and change the rest missing) but even when I change the filtering from &
to |
it still doesn't work. And the genotypes it keeps are 0/0 with FORMAT[AD] missing .
.
Do you know what is the right command to do that? and should I use bcftools +setGT
instead of bcftools filter
?
+setGT
gives more flexibility than filter
, but for this task both will do.
I recommend building the expression gradually. Start with simple rules, e.g. -i 'GQ>=20
', then add -i 'GQ>=20 & DP>=10'
, and run it on a small test case to observe if it behaves as expected.
Note the plugin +fill-tags can be used to annotate with FORMAT/VAF, the equivalent of AB, which allows to simplify the expression.
Hello @pd3 , thank you for your replies. @sleibovitz arrived at the following command that seems to work for most of her purposes:
bcftools +fill-tags "${file_nodash}" -Ou -- -t FORMAT/VAF | \
bcftools +setGT -O u --threads 4 -- -t q -n . -i 'GQ<20 | DP<10 | VAF < 0.2 | VAF > 0.8' | \
bcftools annotate -x ^FORMAT/GT -Oz -o "${vcf_name}.reduced.vcf.gz"
It annotate the file with VAF, removes entries with GQ < 20 or DP < 10 or VAF < 0.2 or VAF > 0.8., and then hard calls the variant.
However, the VAF filter should only apply for heterozygous variants; in some data sets, AD is provided, so VAF is non-zero for homozygous variants; in other data sets, AD is NA for homs, and thus, VAF is NA as well.
Thanks!
This should work to check for hets
-i 'GT="het" & (...)'
and for missing values
-i'VAF!="." & (...)'
http://samtools.github.io/bcftools/bcftools.html#expressions
For example
$ echo -e '##fileformat=VCFv4.2
##contig=<ID=22>
##FORMAT=<ID=GT,Number=1,Type=String>
##FORMAT=<ID=VAF,Number=1,Type=Float>
#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tsmpl
22\t1\t.\tC\tG\t.\t.\t.\tGT:VAF\t0/1:0.1
22\t1\t.\tC\tG\t.\t.\t.\tGT:VAF\t0/1:.
22\t1\t.\tC\tG\t.\t.\t.\tGT:VAF\t0/0:0.1
22\t1\t.\tC\tG\t.\t.\t.\tGT:VAF\t0/0:.' > test.vcf
$ cat test.vcf
##fileformat=VCFv4.2
##contig=<ID=22>
##FORMAT=<ID=GT,Number=1,Type=String>
##FORMAT=<ID=VAF,Number=1,Type=Float>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT smpl
22 1 . C G . . . GT:VAF 0/1:0.1
22 1 . C G . . . GT:VAF 0/0:0.1
$ bcftools view test.vcf -Hi 'GT="het" & VAF>0'
22 1 . C G . . . GT:VAF 0/1:0.1
$ bcftools view test.vcf -Hi 'GT="hom" & VAF="."'
22 1 . C G . . . GT:VAF 0/0:.
Make sure you work with a recent version of bcftools, there were some bugs in handling of missing values in past.
I hope this helps!
Hi! I was trying to filter (or set to missing ".") genotypes using the following command: My goal here is to keep genotypes that pass: GQ>=20, DP >=10 and 0.2<AB<0.8
When I printed genotypes to check it seems that there were only GT 0/0 and AD was always missing "."
I was thinking that maybe a better approach is to use
bcftools +setGT
like in the example in this page:bcftools +setGT in.vcf.bgz -- -t q -n . -i 'FMT/DP>=10 & FMT/GQ>=20'
but not sure how to set it up properly to keep GQ>=20, DP >=10 and 0.2<AB<0.8.I'll appreciate your help!
Best, Shaked