Closed mbhall88 closed 5 years ago
This is defined as an entry in pandora
's VCF which, when taken with flanking sequence, maps over a variant position in the reference panel and has the correct base call at the expected position.
Mismatches are allowed in the flanks.
This is defined as an entry in pandora
's VCF which, when taken with flanking sequence, maps over a variant position in the reference panel and has the incorrect base call at the expected position.
Mismatches are allowed in the flanks.
This is defined as an entry in the reference panel which does not have any variants from pandora
's VCF that map across its variant position.
Note: a pandora
variant call can map to an entry in the reference panel but may not map across the middle position (which is the variant position).
This is basically any position we have left the same as the reference correctly. True negatives would only really be relevant if we decide to apply the variant calls from pandora
onto the consensus sequence from pandora
and do a base-by-base comparison of that to the reference sequence.
Calculation: TP+TN/TP+FP+TN+FN
The discussion point here would be whether to include TN or not?
Calculation: TN/TN+FP
If we decide to "ignore" TNs then this metric would not be relative.
Calculation: TP/TP+FN
A very important metric for us and should be quite straight forward to calculate.
Calculation: TP/TP+FP
Another very important metric for us. However, we need to first decide on our definition of FP.
Calculation: FP/TN+FP
Not relevant if we decide to ignore TNs.
I think there is some ambiguity around the false positive definition. This could conceivably be defined as the total number of mismatches from mapping all pandora
variant calls on to the original reference (not just the reference panel). The catch here though I guess is that it becomes messy as to whether it is de novo's fault or if it is just pandora
's genotype model that is causing any given mismatch.
small typo in definition of accuracy: This Calculation: TP+TN/TP+FP+TN+FN should be Calculation: TP/TP+FP+TN+FN i think
I think it's easy to separate the impact of de novo as follows (naking the assumption that we have ensured that all simulated snps are outside the PRG. Define candidates to mean the list of alleles that de novo generates.
For de novo, define recall = % of simulated mutations where the mutant allele was included in the candidates. precision = %of slices where we perform de novo, that include a simulated-mutation within
these measure whether de novo is doing it's job.
Then, over and above these, measure sens/spec etc for the VCF as you've mentioned above - these measure how well pandora performs when including de novo, and this is our bottom line, combining de novo and genotyping
would it be possible to include a plot which was y axis = recall: what % of sim-mutations are correctly present and genotyped in the VCF x axis = error rate: what % of calls in the VCF are wrong
small typo in definition of accuracy: This Calculation: TP+TN/TP+FP+TN+FN should be Calculation: TP/TP+FP+TN+FN i think
Hmm, ok. I was just going by the definition in Table 1 of the paper I quoted in the first comment.
I think it's easy to separate the impact of de novo as follows (naking the assumption that we have ensured that all simulated snps are outside the PRG. Define candidates to mean the list of alleles that de novo generates.
What do you mean by "ensured that all simulated SNPs are outside the PRG"?
For de novo, define recall = % of simulated mutations where the mutant allele was included in the candidates. precision = %of slices where we perform de novo, that include a simulated-mutation within
So that would involve mapping each probe in the reference panel to each candidate paths fasta file produced by de novo and ensuring the base within the probe that is the mutation maps without mismatch?
small typo in definition of accuracy: This Calculation: TP+TN/TP+FP+TN+FN should be Calculation: TP/TP+FP+TN+FN i think
Hmm, ok. I was just going by the definition in Table 1 of the paper I quoted in the first comment.
argh!|! my mistake!
in this
I think it's easy to separate the impact of de novo as follows (naking the assumption that we have ensured that all simulated snps are outside the PRG. Define candidates to mean the list of alleles that de novo generates.
What do you mean by "ensured that all simulated SNPs are outside the PRG"?
i just mean that if you simulate a path in the prg, and insert a new snp that happens to be in the prg already, then there is no de novo to do
for this
For de novo, define recall = % of simulated mutations where the mutant allele was included in the candidates. precision = %of slices where we perform de novo, that include a simulated-mutation within
So that would involve mapping each probe in the reference panel to each candidate paths fasta file produced by de novo and ensuring the base within the probe that is the mutation maps without mismatch?
i've got confused about what the ref panel is. i just meant check whether the thing you simulated was one of the candidates
It is useful to have definitions for certain evaluation metrics and terms in the context of this project.
Inspiration for these metrics-of-interest comes from the paper "Best practices for evaluating single nucleotide variant calling methods for microbial genomics". Of particular interest for this discussion is Table 1 and Figure 3.