samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
633 stars 241 forks source link

Genotype Hard Calling- keeping GQ>=20, DP >=10 and 0.2<AB<0.8 #2203

Closed sleibovitz closed 1 week ago

sleibovitz commented 1 month ago

Hi! I was trying to filter (or set to missing ".") genotypes using the following command: My goal here is to keep genotypes that pass: GQ>=20, DP >=10 and 0.2<AB<0.8

bcftools filter \
    -e '(FORMAT/GQ < 20 & FORMAT/GQ=".") | 
        (FORMAT/DP < 10 & FORMAT/DP=".") |
        ((FORMAT/AD[*:1]/(FORMAT/AD[*:0]+FORMAT/AD[*:1]) < 0.2) | 
        (FORMAT/AD[*:1]/(FORMAT/AD[*:0]+FORMAT/AD[*:1]) > 0.8))' \
    -S . "${INPUT_VCF}" -Oz -o "${INTERMEDIATE_OUTPUT_VCF}"

When I printed genotypes to check it seems that there were only GT 0/0 and AD was always missing "."

I was thinking that maybe a better approach is to use bcftools +setGT like in the example in this page: bcftools +setGT in.vcf.bgz -- -t q -n . -i 'FMT/DP>=10 & FMT/GQ>=20' but not sure how to set it up properly to keep GQ>=20, DP >=10 and 0.2<AB<0.8.

I'll appreciate your help!

Best, Shaked

pd3 commented 1 month ago

I don't understand what is the question really, but it seems the filtering expression is formatted incorrectly. For example, GQ<20 & GQ="." can never be true at the same. Perhaps you meant GQ<20 | GQ="."?

sleibovitz commented 1 month ago

I want to set 0/0 GT to ./. or . and keep only genotypes that have GQ>=20, DP >=10 and 0.2<AB<0.8 (and change the rest missing) but even when I change the filtering from & to | it still doesn't work. And the genotypes it keeps are 0/0 with FORMAT[AD] missing . .

Do you know what is the right command to do that? and should I use bcftools +setGT instead of bcftools filter?

pd3 commented 4 weeks ago

+setGT gives more flexibility than filter, but for this task both will do.

I recommend building the expression gradually. Start with simple rules, e.g. -i 'GQ>=20', then add -i 'GQ>=20 & DP>=10', and run it on a small test case to observe if it behaves as expected.

Note the plugin +fill-tags can be used to annotate with FORMAT/VAF, the equivalent of AB, which allows to simplify the expression.

TarjinderSingh commented 3 weeks ago

Hello @pd3 , thank you for your replies. @sleibovitz arrived at the following command that seems to work for most of her purposes:

bcftools +fill-tags "${file_nodash}" -Ou -- -t FORMAT/VAF | \
    bcftools +setGT -O u --threads 4 -- -t q -n . -i 'GQ<20 | DP<10 | VAF < 0.2 | VAF > 0.8' | \
    bcftools annotate -x ^FORMAT/GT -Oz -o "${vcf_name}.reduced.vcf.gz"

It annotate the file with VAF, removes entries with GQ < 20 or DP < 10 or VAF < 0.2 or VAF > 0.8., and then hard calls the variant.

However, the VAF filter should only apply for heterozygous variants; in some data sets, AD is provided, so VAF is non-zero for homozygous variants; in other data sets, AD is NA for homs, and thus, VAF is NA as well.

  1. How can we modify that expression to specifically consider the VAF thresholds for hets only? 2 Alternatively, we can add an additional check for if VAF is missing. But VAF != '.' & (VAF < 0.2 | VAF > 0.8)' doesn't seem to work.
  2. Finally, how should we deal with checking missing values in the expression? ie if GQ is missing, we might want to do something else.

Thanks!

pd3 commented 1 week ago

This should work to check for hets

-i 'GT="het" & (...)'

and for missing values

-i'VAF!="." & (...)'

http://samtools.github.io/bcftools/bcftools.html#expressions

For example

$ echo -e '##fileformat=VCFv4.2
##contig=<ID=22>
##FORMAT=<ID=GT,Number=1,Type=String>
##FORMAT=<ID=VAF,Number=1,Type=Float>
#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tsmpl
22\t1\t.\tC\tG\t.\t.\t.\tGT:VAF\t0/1:0.1
22\t1\t.\tC\tG\t.\t.\t.\tGT:VAF\t0/1:.
22\t1\t.\tC\tG\t.\t.\t.\tGT:VAF\t0/0:0.1
22\t1\t.\tC\tG\t.\t.\t.\tGT:VAF\t0/0:.' > test.vcf

$ cat test.vcf
##fileformat=VCFv4.2
##contig=<ID=22>
##FORMAT=<ID=GT,Number=1,Type=String>
##FORMAT=<ID=VAF,Number=1,Type=Float>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  smpl
22  1   .   C   G   .   .   .   GT:VAF  0/1:0.1
22  1   .   C   G   .   .   .   GT:VAF  0/0:0.1

$ bcftools view test.vcf -Hi 'GT="het" & VAF>0'
22  1   .   C   G   .   .   .   GT:VAF  0/1:0.1

$ bcftools view test.vcf -Hi 'GT="hom" & VAF="."'
22  1   .   C   G   .   .   .   GT:VAF  0/0:.

Make sure you work with a recent version of bcftools, there were some bugs in handling of missing values in past.

I hope this helps!