Open mbhall88 opened 4 years ago
I remember being confused by these two different error rates when refactoring some stuff, but I did not go deep in the investigation as you did, as I was worried about assuring the refactoring would produce the same results.
Just for complementing, this is how we compute the log likelihood:
As Michael previously stated, this e should be the error rate, which is currently always fixed at 0.01. From my understanding, this is a bug, and it is easy to fix.
If we assume ONT error rate is around 0.11, ln(0.11) = -2.20
. But we have been used the fixed ln(0.01) = -4.60
. So, we have been giving more weight than we should to c_b
(the total mean coverage of all other alleles), effectively pushing likelihoods down. But this has a larger effect on the likelihood of lowly covered alleles than on the highly covered alleles (which are possibly the ones chosen), so I think for ONT data this is leading to overestimated genotype confidences. I think for illumina the effect is opposite, we are underestimating genotype confidences. I could be wrong on these...
Anyway, it is hard to predict what is the impact of this fix on the paper results. My blind guess is that it won't change much, but it is just really a blind guess. It is indeed easy to fix, but it would require us to reproduce the results for the paper once again. This fix should take no more than 10 minutes (I guess), and putting everything to run should be 5 minutes, but then we have to wait for the results to be produced.
Maybe we should confirm this is indeed a bug first...
We definitely don't touch this lightly. This is exactly where the difference between our mental models and reality is most clear. Definitely do not fix this Leandro until you have weeks available for debugging. I propose we add this (leandros bug) to the list of things to return to. Returning to Michaels issue: I'd like to understand both the intention and the current implementation and how it has been evaluated. But I want to be free of the current super heavyweight pipelines first and evaluate on something small and controlled post paper. I've chatted a bit to Rachel this week, maybe we can have a brief 4 way call next week
There is always the possibility of evaluating this with my TB dataset instead? Much faster turn around time, and I am doing the evaluation anyway...
Interesting. I think this must be a bug. I made the commit here https://github.com/rmcolq/pandora/commit/062e4ebe84424619241c59a14582536000422700#diff-23632e736a3562808e4b2d5bf5dc59c7 to add the genotyping_error_rate
parameter. It was just a constant number before this. However I did not carry it through any further. I have no idea why the error rate there should be different as the error rate used for alignment things (although I would hesitate to say that they should be the same without more brain power to think it through)
I have no idea why the error rate there should be different as the error rate used for alignment things (although I would hesitate to say that they should be the same without more brain power to think it through)
Yeah, speaking with Zam we came to this conclusion too. Requires some careful thinking...
I have been playing around with these two error rates today.
The data I am parameter sweeping with is Nanopore Mtb data. They have "truth" assemblies from PacBio CCS and the evaluation is done with varifier
.
The x-axis labels indicate the filters/parameters used. The COMPASS (left-most) box is the sample's Illumina data SNP calls from the pipeline of the same name and bcftools is SNP calls for the nanopore data on that tool.
Filters/parameters key:
For a more comprehensive examination of a filter sweep, see here.
I've been staring at these sorts of plots all day, so it would be nice to have some fresh eyes to interpret.
what are the default values for e and E?
Is it possible to replot these only for samples with >40x depth, so we can disentangle coverage?
just deleted a comment. So - we can achieve the same precision as compass for SNPs, if we lose 15% recall (some of this will be due to low coverage samples and not having GCP implemented) Instead of snps/All could we see SNPs/indels?
what are the default values for e and E?
e = 0.11 (nanopore) and E = 0.01 (always)
I'll put together those other plots today
Is it possible to replot these only for samples with >40x depth, so we can disentangle coverage?
Here are the plots with samples that have coverage over 40x (which is 5/7). Note: there is a few extra filters in there now as I was concurrently working on https://github.com/mbhall88/head_to_head_pipeline/issues/48
Instead of snps/All could we see SNPs/indels?
This is a lot harder and will take a bit longer. I will have to create indel-only VCFs and then run varifier on them.
I have a slight concern/misunderstanding regarding error rate.
We have two separate error rates in pandora.
-e,--error_rate
which defaults to 0.11 for nanopore and 0.001 for Illumina.--genotyping-error-rate
(which was hidden prior to my CLI PR) which is set to 0.01.The genotyping error rate is used for computing likelihood in the VCF whereas the first error rate is used for alignment-related tasks, estimating parameters for the kmer graph model, and de novo discovery.
Why are these error rates different? From Rachel's thesis (p65 - 5.2.2 Genotyping)
Shouldn't this error rate also be a function of the sequencing technology used?
I.e. from p45 4.3.2 Quasi-mapping to the Index (Setting the minimum cluster size)