Closed mbhall88 closed 4 years ago
In the TAC2 report we said that the way we will represent concordance is
The results for this subsection will show concordance of the basic Nanopore variant calling with the Illumina truth set. We will discuss how the different filters trialled affected the results and plot recall against concordance.
@iqbal-lab So, to clarify, recall will be Option 1 from above and concordance will be Option2?
Regarding recall, there is an outstanding question of exactly how to calculate this.
a = COMPASS
b = bcftools
Firstly, we need to calculate the denominator in recall. T. We take all positions where a is not masked or fail-filter and is ALT. QUESTION: do we exclude positions where b is NULL, FAIL-FILTER, HET, and MISSING?
After T has been determined. the numerator is easy enough to determine: number of positions in T where b agrees with a.
A similar question also remains for the calculation of concordance. i.e what sites do we exclude from contribution to concordance?
Answers
So, to clarify, recall will be Option 1 from above and concordance will be Option2?
Yes
For the recall denominator do we exclude positions where bcftools is NULL, FAIL-FILTER, HET, and MISSING?
yes to NULL, FAIL-FILTER, and MISSING yes to HET, but good to notice if it is there but het.
for concordance denominator
same
yes to HET, but good to notice if it is there but het.
I slightly altered the output now so that I output the classification as before, but I also output the classification for each VCF at each position too. So we can shift our definitions of concordance/recall pretty easily now.
For example
34,REF,REF,TRUE_REF
35,REF,NULL,FALSE_NULL
Here is an initial idea of concordance for different bcftools
parameters and coverage (150x and 60x). This is without any bcftools filtering.
recall | concordance | |
---|---|---|
mada_112.60x.noindels.nobaq | 99.7230% | 99.9996% |
mada_112.noindels.nobaq | 99.7230% | 99.9996% |
mada_112.60x.noindels | 99.8615% | 99.9989% |
mada_112.nobaq | 99.7230% | 99.9996% |
mada_112.60x.nobaq | 99.7230% | 99.9996% |
mada_112.60x | 99.8615% | 99.9989% |
mada_112 | 100.0000% | 99.9998% |
mada_112.noindels | 100.0000% | 99.9998% |
To be explicit, I link the lines of code that currently calculate recall and concordance and describe the exact way I have calculated them:
a = compass VCF b = bcftools VCF F = Dataframe containing 4 columns: position (F_p), a-classification (F_a), b-classification (F_b), overall-classification (F_o)
Recall:
alt_calls = F where F_a is ALT
valid_alts = alt_calls where F_o is not in {Both Fail Filter, A Fails Filter, Masked}
called_alts = valid_alts where F_o is TrueAlt
recall = |called_alts| / |valid_alts|
Concordance:
total_calls = F where F_a is in {REF, ALT}
valid_calls = total_calls where F_o is not in {Both Fail Filter, A Fails Filter, Masked}
correct_calls = valid_calls where F_o is in {True Alt, True Ref}
concordance = |correct_calls| / |valid_calls|
A more thorough understanding can also be gained from looking through the unit tests that test these two functions.
@iqbal-lab these definitions do differ slightly to what you suggested above. In particular, I penalise bcftools in the following scenarios:
--ploidy 1
so we won't actually get het calls from bcftoolsThe results should also answer the quesiton of which parameters we should use for variant calling all of the nanopore data [#18]. Whilst, the full dataset with default parameters and no-indels option scored perfect recall and near-perfect concordance, defaults took ~35 hours to run and used ~58Gb RAM, whilst the no-indel option (i.e. with BAQ) took ~3 hours and used ~58Gb RAM also. The no-BAQ option took ~20 hours and 3Gb RAM, whilst the no-BAQ and no-indel combination took ~30 mins and 70Mb RAM.
Those are pretty amazing results. I'll look more carefully over the day, but for now
No, compass is on illumina, bcftools is on nanopore
That's why you have two metrics
this will be seen in call rate/recall
ok
That's why you have two metrics
call rate - what % of the truth sites does bcftools say anything concordance - how often is it right when it says something
Those definitions are quite different to https://github.com/mbhall88/head_to_head_pipeline/issues/38#issue-659929150 ...
Yes, you are right. I've just written and deleted a complicated comment comparing all these definitions. Let me suggest the following definitions which are an extension of what we have, and also have the benefit of being standard terms
what % of true ALTs does bcftools make a REF/ALT call - anything except NULL and FILTER
what % of true ALTs does bcftools genotype agree with truth (excludes NULL, FILTER, and HET)
what % of true REF or ALT positions does bcftools make a REF/ALT call - anything except NULL and FILTER
what % of true REF or ALT positions does bcftools genotype agree with truth (excludes NULL, FILTER, and HET)
This includes the previous definitions, and (I think) clarifies and extends. Concordance for true ALTs could be considered a "recall" of non-ref variants.
So, if bcftools is calling lots of NULLs we will see it in the call rate, and if it is getting genotypes wrong, we see it in concordance. The first concordance is about the 1000 or so non-ref positions; the second is dominated by loads of REF calls.
FYI - COMPASS aims to have a genomewide error rate of <1 per genome, and is willing to pay the price in call rate - they are happy to make calls in a small % of the genome at the price of ~all calls being right. So for bcftools, one would be aiming to have a super-high genome-wide concordance.
Great! Thanks, this makes a lot more sense now. I am assuming missing positions are classed the same as NULL/FILTER?
Yes, missing equals Null in my book. What else could it be? Oh...not in the vcf? Yrs, that too
Yes, missing equals Null in my book. What else could it be? Oh...not in the vcf? Yrs, that too
Yeah, not in VCF. I.e. I guess minimap2 couldn't produce any primary alignments for that part of the genome.
Updated results with new definitions (no filtering):
call_rate | concordance | gw_call_rate | gw_concordance | |
---|---|---|---|---|
mada_112.60x.noindels.nobaq | 100.000000% | 99.722992% | 99.999975% | 99.999582% |
mada_112.noindels.nobaq | 100.000000% | 99.722992% | 100.000000% | 99.999558% |
mada_112.60x.noindels | 100.000000% | 99.861496% | 99.999189% | 99.999730% |
mada_112.nobaq | 100.000000% | 99.722992% | 100.000000% | 99.999558% |
mada_112.60x.nobaq | 100.000000% | 99.722992% | 99.999975% | 99.999582% |
mada_112.60x | 100.000000% | 99.861496% | 99.999189% | 99.999730% |
mada_112 | 100.000000% | 100.000000% | 99.999975% | 99.999779% |
mada_112.noindels | 100.000000% | 100.000000% | 99.999975% | 99.999779% |
Current plots for concordance and call rate with depth are in this report: https://github.com/mbhall88/head_to_head_pipeline/blob/400a4c1ff81b06d22626f51224c6dd00df1b1a8a/analysis/baseline_variants/report.html
After filtering refinement using varifier
(see https://github.com/mbhall88/head_to_head_pipeline/issues/39#issuecomment-673898811) we have a good checkpoint for the concordance at c645cd8. The report is uploaded to that commit along with all of the plots. A couple are pasted below.
We will look at two different kinds of concordance:
bcftools
on nanopore agree?bcftools
on nanopore agree?