iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
109 stars 14 forks source link

Error rate vs. genotyping error rate #232

Open mbhall88 opened 4 years ago

mbhall88 commented 4 years ago

I have a slight concern/misunderstanding regarding error rate.

We have two separate error rates in pandora.

  1. -e,--error_rate which defaults to 0.11 for nanopore and 0.001 for Illumina.
  2. --genotyping-error-rate (which was hidden prior to my CLI PR) which is set to 0.01.

The genotyping error rate is used for computing likelihood in the VCF whereas the first error rate is used for alignment-related tasks, estimating parameters for the kmer graph model, and de novo discovery.

Why are these error rates different? From Rachel's thesis (p65 - 5.2.2 Genotyping)

We model the mean k-mer coverage on the true allele as Poisson distributed with rate equal to the expected depth of coverage d. To account for noise, we assume that coverage on alternative alleles arises as a result of an error process with rate e.

Shouldn't this error rate also be a function of the sequencing technology used?
I.e. from p45 4.3.2 Quasi-mapping to the Index (Setting the minimum cluster size)

To account for sequencing errors, we estimate that if errors occur independently as a Poisson distribution with rate e along a sequence, the probability that a k-mer has been sequenced with no errors is exp(−ke).

leoisl commented 4 years ago

I remember being confused by these two different error rates when refactoring some stuff, but I did not go deep in the investigation as you did, as I was worried about assuring the refactoring would produce the same results.

Just for complementing, this is how we compute the log likelihood:

image

As Michael previously stated, this e should be the error rate, which is currently always fixed at 0.01. From my understanding, this is a bug, and it is easy to fix.

If we assume ONT error rate is around 0.11, ln(0.11) = -2.20. But we have been used the fixed ln(0.01) = -4.60 . So, we have been giving more weight than we should to c_b (the total mean coverage of all other alleles), effectively pushing likelihoods down. But this has a larger effect on the likelihood of lowly covered alleles than on the highly covered alleles (which are possibly the ones chosen), so I think for ONT data this is leading to overestimated genotype confidences. I think for illumina the effect is opposite, we are underestimating genotype confidences. I could be wrong on these...

Anyway, it is hard to predict what is the impact of this fix on the paper results. My blind guess is that it won't change much, but it is just really a blind guess. It is indeed easy to fix, but it would require us to reproduce the results for the paper once again. This fix should take no more than 10 minutes (I guess), and putting everything to run should be 5 minutes, but then we have to wait for the results to be produced.

Maybe we should confirm this is indeed a bug first...

iqbal-lab commented 4 years ago

We definitely don't touch this lightly. This is exactly where the difference between our mental models and reality is most clear. Definitely do not fix this Leandro until you have weeks available for debugging. I propose we add this (leandros bug) to the list of things to return to. Returning to Michaels issue: I'd like to understand both the intention and the current implementation and how it has been evaluated. But I want to be free of the current super heavyweight pipelines first and evaluate on something small and controlled post paper. I've chatted a bit to Rachel this week, maybe we can have a brief 4 way call next week

mbhall88 commented 4 years ago

There is always the possibility of evaluating this with my TB dataset instead? Much faster turn around time, and I am doing the evaluation anyway...

rmcolq commented 4 years ago

Interesting. I think this must be a bug. I made the commit here https://github.com/rmcolq/pandora/commit/062e4ebe84424619241c59a14582536000422700#diff-23632e736a3562808e4b2d5bf5dc59c7 to add the genotyping_error_rate parameter. It was just a constant number before this. However I did not carry it through any further. I have no idea why the error rate there should be different as the error rate used for alignment things (although I would hesitate to say that they should be the same without more brain power to think it through)

mbhall88 commented 4 years ago

I have no idea why the error rate there should be different as the error rate used for alignment things (although I would hesitate to say that they should be the same without more brain power to think it through)

Yeah, speaking with Zam we came to this conclusion too. Requires some careful thinking...

mbhall88 commented 3 years ago

I have been playing around with these two error rates today.

The data I am parameter sweeping with is Nanopore Mtb data. They have "truth" assemblies from PacBio CCS and the evaluation is done with varifier.

The x-axis labels indicate the filters/parameters used. The COMPASS (left-most) box is the sample's Illumina data SNP calls from the pipeline of the same name and bcftools is SNP calls for the nanopore data on that tool.

Filters/parameters key:

For a more comprehensive examination of a filter sweep, see here.

Recall - SNPs only

image

Precision - SNPs only

image

Recall - all variants

image

Precision - all variants

image


I've been staring at these sorts of plots all day, so it would be nice to have some fresh eyes to interpret.

iqbal-lab commented 3 years ago

what are the default values for e and E?

iqbal-lab commented 3 years ago

Is it possible to replot these only for samples with >40x depth, so we can disentangle coverage?

iqbal-lab commented 3 years ago

just deleted a comment. So - we can achieve the same precision as compass for SNPs, if we lose 15% recall (some of this will be due to low coverage samples and not having GCP implemented) Instead of snps/All could we see SNPs/indels?

mbhall88 commented 3 years ago

what are the default values for e and E?

e = 0.11 (nanopore) and E = 0.01 (always)

I'll put together those other plots today

mbhall88 commented 3 years ago

Is it possible to replot these only for samples with >40x depth, so we can disentangle coverage?

Here are the plots with samples that have coverage over 40x (which is 5/7). Note: there is a few extra filters in there now as I was concurrently working on https://github.com/mbhall88/head_to_head_pipeline/issues/48

Recall - SNPs only

image

Precision - SNPs only

image

Recall - all variants

image

Precision - all variants

image


Instead of snps/All could we see SNPs/indels?

This is a lot harder and will take a bit longer. I will have to create indel-only VCFs and then run varifier on them.