iqbal-lab / cortex

reference free variant assembly
32 stars 13 forks source link

Fix known bugs in genotyping and look at confidence distribution #11

Open iqbal-lab opened 8 years ago

iqbal-lab commented 8 years ago

To fix:

  1. poisson error rate uses estimates of total read count - prefer to use median depth
  2. improve gt confidence
iqbal-lab commented 8 years ago

At this commit on branch fix_poisson:

commit 85c17ccffaeb14c691d489dbaaf11e144e310cb6 Author: Zamin Iqbal zam@well.ox.ac.uk Date: Tue Nov 17 13:51:44 2015 +0000

Add script for getting confusion matrices

Results are looking pretty interesting.

I've taken a binary of NA12878 (this is a human sample, and have used Platinum genome data, cleaning with threshold 7), and intersected with 5000 random OMNI sites. I then compare with the 1000g calls at those sites as truth

I then calculate confusion matrices of genotyping for v1.0.5.21 of Cortex and for this commit, and compare them. To clarify the signal, I produce confusion matrices for calls with genotype confidence between 0 and 9, and for those between 10 and 19, and... Here's some results. Left column is v1.0.5.21 and right is tip of fix_poisson

comp

Notice that we now put more stuff in the low confidence bin, but by the time confidence>10, in the new commit, we lose much fewer hets, but we do call more as missing (which is bizarre). So apart from the extra missingness, I am v happy

iqbal-lab commented 8 years ago

Oh no it's OK! There are a bunch of sites which we call as missing - in the old commit they have low confidence and now they have higher. I think the missingness is likely because these OMNI sites are actually adjacent to indels. There is a separate question as to what the confidence of a ./. call is - will look into this

iqbal-lab commented 8 years ago

Merged into master here

https://github.com/iqbal-lab/cortex/commit/85c17ccffaeb14c691d489dbaaf11e144e310cb6

iqbal-lab commented 8 years ago

I just want to check the genotyping of longer alleles and then ok to close. All unit tests pass including with valgrind.

hcdenbakker commented 8 years ago

'There is a separate question as to what the confidence of a ./. call is - will look into this' Would indeed be great if we could distinguish between truly missing (deletion) and uncertainty in calling. Back to my lectures now, just had to add this.

iqbal-lab commented 8 years ago

OK, so in two parts.

Why does a ./. call have a confidence?

When Cortex genotypes a bubble, you pass it a model. That model is diploid or haploid at the moment, and allows you to be either hom-ref, het, or hom-alt (for diploid, haploid is either hom-ref or hom-alt). Now suppose one allele is longer than another. And suppose we have zero covg on both alleles. Well then, the most likely model is homozygous for the shorter allele, as more chance of having zero covg on two short alleles. So, Cortex types it thus. Then, a wrapper script converts the cortex output to a VCF, and forcibly sets genotype to ./. if there is zero coverage on both alleles, as that is what most consumers want, but leaves the confidence unchanged.

Can we distinguish truly missing from uncertainty in calling If, Henk, you mean to ask us to support the possibility, at any site, of the whole locus being deleted - I guess we could do that, but currently do not. One reason for going back over all samples and genotyping at all sites is to be able to make a call on SNPs/whatever that were not called in person-X, to see if we think they are hom-ref, or actually het/hom-alt, but had not been called previously due to low confidence.