samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
680 stars 240 forks source link

How to populate Fisher test annotation #2287

Open pd3 opened 1 month ago

pd3 commented 1 month ago

Hello, I'm revisiting this comment as I've run into a similar observation in my dataset- the FS field (fisher test for strand bias) does not seem to be correctly populated — it is coming up as zero for all genotypes so everything is passing. I've read through this thread and am a bit confused- is this something that needs to be explicitly specified (ie, accounting for this is not the default)?

We've just spent a long time calling reads for a very large dataset and would like to avoid doing it all over again. Is there a way to populate this field after the bcftools mpileup? We have retained the read depth for for/rev reads supporting alt/ref so is it possible to calculate this without recalling genotypes? Thank you so much!

Originally posted by @paigeduffin in https://github.com/samtools/bcftools/issues/1834#issuecomment-2354477466

pd3 commented 1 month ago

Depending on the version used, recent versions of mpileup have to be run with -a INFO/FS in order for the field to be populated. Run bcftools mpileup -a \? to see which annotations are filled in by default and which need to be given explicitly.

However, what to do now, if recalling is not an option? I see three possibilities:

1) Extract the necessary information using bcftools query, compute the test using a third-party software, then reannotate with bcftools annotate.

2) Use the bcftools +ad-bias plugin, it can filter by Fisher test on FMT/AD

3) Perhaps best would be to extend the plugin +fill-tags, currently it supports the following tags, adding Fisher test would be trivial

$ bcftools +fill-tags -- -l  
INFO/AC        Number:A  Type:Integer  ..  Allele count in genotypes
INFO/AC_Hom    Number:A  Type:Integer  ..  Allele counts in homozygous genotypes
INFO/AC_Het    Number:A  Type:Integer  ..  Allele counts in heterozygous genotypes
INFO/AC_Hemi   Number:A  Type:Integer  ..  Allele counts in hemizygous genotypes
INFO/AF        Number:A  Type:Float    ..  Allele frequency from FMT/GT or AC,AN if FMT/GT is not present
INFO/AN        Number:1  Type:Integer  ..  Total number of alleles in called genotypes
INFO/ExcHet    Number:A  Type:Float    ..  Test excess heterozygosity; 1=good, 0=bad
INFO/END       Number:1  Type:Integer  ..  End position of the variant
INFO/F_MISSING Number:1  Type:Float    ..  Fraction of missing genotypes (all samples, experimental)
INFO/HWE       Number:A  Type:Float    ..  HWE test (PMID:15789306); 1=good, 0=bad
INFO/MAF       Number:1  Type:Float    ..  Frequency of the second most common allele
INFO/NS        Number:1  Type:Integer  ..  Number of samples with data
INFO/TYPE      Number:.  Type:String   ..  The record type (REF,SNP,MNP,INDEL,etc)
FORMAT/VAF     Number:A  Type:Float    ..  The fraction of reads with the alternate allele, requires FORMAT/AD or ADF+ADR
FORMAT/VAF1    Number:1  Type:Float    ..  The same as FORMAT/VAF but for all alternate alleles cumulatively
TAG:Number=Type(EXPR)                  ..  Experimental support for user expressions such as DP:1=int(sum(DP))
               If Number and Type are not given (e.g. DP=sum(DP)), variable number (Number=.) of floating point
               values (Type=Float) will be used.