vibansal / crisp

Code for multi-sample variant calling from sequence data of pooled or unpooled DNA samples
MIT License
19 stars 8 forks source link

Number of reads, allele count and allele frequency #9

Closed abofinal closed 5 years ago

abofinal commented 5 years ago

Hello, I'm having trouble making sense of some of the information fields from the vcf output. In this example:

2L 4602 . G A 182 LowDepth NP=24;DP=45,0,0;VT=SNV;CT=-0.0;VP=0;VF=EMpass;AC=651;AF=0.17108;EMstats=18.22:-0.30;HWEstats=-0.0;MQS=0,0,6,39;FLANKSEQ=gacagaggaa:G:cagaacagat AC:GQ:DP:ADf:ADr:ADb .:0:0:0,0:0,0:0,0 .:0:1:1,0:0,0:0,0 .:0:0:0,0:0,0:0,0 .:0:1:1,0:0,0:0,0 .:0:0:0,0:0,0:0,0 .:0:0:0,0:0,0:0,0 .:0:2:1,1:0,0:0,0 .:0:0:0,0:0,0:0,0 .:0:1:1,0:0,0:0,0 .:0:0:0,0:0,0:0,0 .:0:1:1,0:0,0:0,0 .:0:0:0,0:0,0:0,0 .:0:2:1,1:0,0:0,0 .:0:1:1,0:0,0:0,0 .:0:1:1,0:0,0:0,0 .:0:0:0,0:0,0:0,0 .:0:1:1,0:0,0:0,0 .:0:0:0,0:0,0:0,0 .:0:0:0,0:0,0:0,0 .:0:0:0,0:0,0:0,0 .:0:3:3,0:0,0:0,0 .:0:5:2,3:0,0:0,0 .:0:13:11,2:0,0:0,0 .:0:13:12,1:0,0:0,0

The number of reads is 45 (field 'DP', there are only forward reads in this example). The allele count is 651 (field 'AC') And the allele frequency is 0.17108 (field 'AF')

The thing is, I counted the reference and alternate alleles from all 24 pools in my example and there are 37 reference and 8 alternate, which would mean the frequency of alternate alleles would be 8/37=0.216216...

So why is allele frequency 0.17108? And where does that huge 651 value in allele count come from? I'm assuming my understanding of how allele frequency is calculated here is deeply flawed (I guess AC is used in AF's calculation, but I don't know how) and that's why I can't understand this output, would you mind explaining this for me?

Thanks a lot! Awesome tool, by the way.

vibansal commented 5 years ago

The 'AF' value is calculated jointly using all pools using an EM algorithm that sums over the likelihood of all possible allele counts. When coverage is low (such as in this case), the AC values for each pool are not reliable. To calculate the allele frequency using the AC counts, you need to sum over the AC values and divide by the product of the pool-size and the number of pools.

abofinal commented 5 years ago

Ah, thanks for your quick response! That clears my doubts away, you were a great help!