iobio / vcf.iobio.io

MIT License
27 stars 11 forks source link

AF calculation #29

Open tonydisera opened 8 years ago

tonydisera commented 8 years ago

From Al: I just looked at the vcf from 10X and the AF plot shows that everything is at 50%. I looked at the vcf file and there are many genotypes that are not 0/1 with depths supporting only alternate observations, for example. The AF in the info fields are also not 0.5.

Can we check on the AF calculation to be sure we are calculating this correctly?

http://vcf.iobio.io/?vcf=https://s3.amazonaws.com/iobio/app_testing/vcf_files/NA12878_10X.vcf.gz

screen shot 2016-06-08 at 8 20 18 am
tonydisera commented 8 years ago

From Yi - code review welcomed. here are the lines regarding AF calculation

https://github.com/yiq/vcfstatsalive/blob/master/BasicStatsCollector.cpp#L148-L156

One confession is that right now vcfstatsalive does not handle multi-allelic sites. It simple grabs the first value available.

tonydisera commented 8 years ago

From Yi - two revelation while talking to @tonydisera

1) my code actually does multi-allelic. It's just that it's not doing multi-allelic when it comes to AF 2) I probably should change the code that calculates allele frequency based on read counts to sample ratio, because AO / DP is not really what people would come to expect what allele frequency is. That's more of a cancer concept

tonydisera commented 8 years ago

From Al - Looking at the 10X vcf, the AF field is present in the INFO string, so vcfstatsalive should be pulling this value. There are many cases where AF=1.0. It is a single sample vcf, so there are only instances of AF=0.5 and AF=1.0, so I would expect the distribution to show values at 1.0.

And yes, we should probably change the calculation to reflect the number of alleles out of the total showing the alternate, so AF = AC / AN. I'm guessing that if AF is not present in the vcf though, neither will AC and AN, so this might be moot. In that case, we may have no choice but to default to the depth calculation you already have. The other alternative would be to parse all of the genotypes and count AC and AN, but that would be violently expensive for vcf files with many samples, so probably not feasible.

tonydisera commented 8 years ago

Looking at only multi-allelics (chr1) of this file, all AFs are set to .5, with a few occasional 1 or 0:

AF=0.5,0.5 AF=0.5,0.5 AF=0.5,0.5 AF=0.5,0.5 AF=0.5,0.5 AF=0.5,0.5 AF=0.5,0.5 AF=0.5,0.5 AF=0.5,0.5 AF=0.5,0.5 AF=0.5,0.5 AF=0.0,0.0,0.5

tonydisera commented 8 years ago

violently expensive? word well chosen. …

On Wed, Jun 8, 2016 at 10:53 AM, Alistair Ward _@_.***> wrote: Looking at the 10X vcf, the AF field is present in the INFO string, so vcfstatsalive should be pulling this value. There are many cases where AF=1.0. It is a single sample vcf, so there are only instances of AF=0.5 and AF=1.0, so I would expect the distribution to show values at 1.0.

And yes, we should probably change the calculation to reflect the number of alleles out of the total showing the alternate, so AF = AC / AN. I'm guessing that if AF is not present in the vcf though, neither will AC and AN, so this might be moot. In that case, we may have no choice but to default to the depth calculation you already have. The other alternative would be to parse all of the genotypes and count AC and AN, but that would be violently expensive for vcf files with many samples, so probably not feasible.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <#314 (comment)>, or mute the thread https://github.com/notifications/unsubscribe/AAWIAB7gBjlSHrIR6tik7t7iFMuJrFoAks5qJvN3gaJpZM4IxFQV .

tonydisera commented 8 years ago

See https://github.com/tonydisera/vcf.iobio.io/issues/29