mbhall88 / head_to_head_pipeline

Snakemake pipelines to run the analysis for the Illumina vs. Nanopore comparison.
GNU General Public License v3.0
5 stars 2 forks source link

Concordance of baseline variant callers #38

Closed mbhall88 closed 4 years ago

mbhall88 commented 4 years ago

We will look at two different kinds of concordance:

  1. Positions where COMPASS says ALT, how often does bcftools on nanopore agree?
  2. At all positions where COMPASS "makes a statement", how often does bcftools on nanopore agree?
mbhall88 commented 4 years ago

In the TAC2 report we said that the way we will represent concordance is

The results for this subsection will show concordance of the basic Nanopore variant calling with the Illumina truth set. We will discuss how the different filters trialled affected the results and plot recall against concordance.

@iqbal-lab So, to clarify, recall will be Option 1 from above and concordance will be Option2?

mbhall88 commented 4 years ago

Regarding recall, there is an outstanding question of exactly how to calculate this.

a = COMPASS
b = bcftools

Firstly, we need to calculate the denominator in recall. T. We take all positions where a is not masked or fail-filter and is ALT. QUESTION: do we exclude positions where b is NULL, FAIL-FILTER, HET, and MISSING?

After T has been determined. the numerator is easy enough to determine: number of positions in T where b agrees with a.

mbhall88 commented 4 years ago

A similar question also remains for the calculation of concordance. i.e what sites do we exclude from contribution to concordance?

iqbal-lab commented 4 years ago

Answers

So, to clarify, recall will be Option 1 from above and concordance will be Option2?

Yes

For the recall denominator do we exclude positions where bcftools is NULL, FAIL-FILTER, HET, and MISSING?

yes to NULL, FAIL-FILTER, and MISSING yes to HET, but good to notice if it is there but het.

for concordance denominator

same

mbhall88 commented 4 years ago

yes to HET, but good to notice if it is there but het.

I slightly altered the output now so that I output the classification as before, but I also output the classification for each VCF at each position too. So we can shift our definitions of concordance/recall pretty easily now.

For example

34,REF,REF,TRUE_REF
35,REF,NULL,FALSE_NULL
mbhall88 commented 4 years ago

Here is an initial idea of concordance for different bcftools parameters and coverage (150x and 60x). This is without any bcftools filtering.

recall concordance
mada_112.60x.noindels.nobaq 99.7230% 99.9996%
mada_112.noindels.nobaq 99.7230% 99.9996%
mada_112.60x.noindels 99.8615% 99.9989%
mada_112.nobaq 99.7230% 99.9996%
mada_112.60x.nobaq 99.7230% 99.9996%
mada_112.60x 99.8615% 99.9989%
mada_112 100.0000% 99.9998%
mada_112.noindels 100.0000% 99.9998%

To be explicit, I link the lines of code that currently calculate recall and concordance and describe the exact way I have calculated them:

a = compass VCF b = bcftools VCF F = Dataframe containing 4 columns: position (F_p), a-classification (F_a), b-classification (F_b), overall-classification (F_o)

Recall:

alt_calls = F where F_a is ALT
valid_alts = alt_calls where F_o is not in {Both Fail Filter, A Fails Filter, Masked}
called_alts = valid_alts where F_o is TrueAlt
recall = |called_alts| / |valid_alts|

Concordance:

total_calls = F where F_a is in {REF, ALT}
valid_calls = total_calls where F_o is not in {Both Fail Filter, A Fails Filter, Masked}
correct_calls = valid_calls where F_o is in {True Alt, True Ref}
concordance = |correct_calls| / |valid_calls|

A more thorough understanding can also be gained from looking through the unit tests that test these two functions.


@iqbal-lab these definitions do differ slightly to what you suggested above. In particular, I penalise bcftools in the following scenarios:

The results should also answer the quesiton of which parameters we should use for variant calling all of the nanopore data [#18]. Whilst, the full dataset with default parameters and no-indels option scored perfect recall and near-perfect concordance, defaults took ~35 hours to run and used ~58Gb RAM, whilst the no-indel option (i.e. with BAQ) took ~3 hours and used ~58Gb RAM also. The no-BAQ option took ~20 hours and 3Gb RAM, whilst the no-BAQ and no-indel combination took ~30 mins and 70Mb RAM.

iqbal-lab commented 4 years ago

Those are pretty amazing results. I'll look more carefully over the day, but for now

compass makes a call; bcftools says NULL. If compass has made a call, then surely that suggests the data should be sufficient at this position to make a call?

No, compass is on illumina, bcftools is on nanopore

compass makes a call; bcftools fails FILTER. This one I was a little unsure about. In my mind, if we exempt positions where bcftools fails filtering, you could get perfect concordance by just filtering every position?

That's why you have two metrics

  1. call rate - what % of the truth sites does bcftools say anything
  2. concordance - how often is it right when it says something

compass makes a call; bcftools is missing the position. In the sample I've been working with, missing positions in bcftools are normally NULL in compass anyway

this will be seen in call rate/recall

compass makes a (non-het) call; bcftools makes a HET call. However, we use --ploidy 1 so we won't actually get het calls from bcftools

ok

mbhall88 commented 4 years ago

That's why you have two metrics

call rate - what % of the truth sites does bcftools say anything concordance - how often is it right when it says something

Those definitions are quite different to https://github.com/mbhall88/head_to_head_pipeline/issues/38#issue-659929150 ...

iqbal-lab commented 4 years ago

Yes, you are right. I've just written and deleted a complicated comment comparing all these definitions. Let me suggest the following definitions which are an extension of what we have, and also have the benefit of being standard terms

Call rate (for true ALTs)

what % of true ALTs does bcftools make a REF/ALT call - anything except NULL and FILTER

Concordance (for true ALTs)

what % of true ALTs does bcftools genotype agree with truth (excludes NULL, FILTER, and HET)

Genomewide Call rate

what % of true REF or ALT positions does bcftools make a REF/ALT call - anything except NULL and FILTER

Genomewide Concordance

what % of true REF or ALT positions does bcftools genotype agree with truth (excludes NULL, FILTER, and HET)

This includes the previous definitions, and (I think) clarifies and extends. Concordance for true ALTs could be considered a "recall" of non-ref variants.

So, if bcftools is calling lots of NULLs we will see it in the call rate, and if it is getting genotypes wrong, we see it in concordance. The first concordance is about the 1000 or so non-ref positions; the second is dominated by loads of REF calls.

FYI - COMPASS aims to have a genomewide error rate of <1 per genome, and is willing to pay the price in call rate - they are happy to make calls in a small % of the genome at the price of ~all calls being right. So for bcftools, one would be aiming to have a super-high genome-wide concordance.

mbhall88 commented 4 years ago

Great! Thanks, this makes a lot more sense now. I am assuming missing positions are classed the same as NULL/FILTER?

iqbal-lab commented 4 years ago

Yes, missing equals Null in my book. What else could it be? Oh...not in the vcf? Yrs, that too

mbhall88 commented 4 years ago

Yes, missing equals Null in my book. What else could it be? Oh...not in the vcf? Yrs, that too

Yeah, not in VCF. I.e. I guess minimap2 couldn't produce any primary alignments for that part of the genome.

mbhall88 commented 4 years ago

Updated results with new definitions (no filtering):

call_rate concordance gw_call_rate gw_concordance
mada_112.60x.noindels.nobaq 100.000000% 99.722992% 99.999975% 99.999582%
mada_112.noindels.nobaq 100.000000% 99.722992% 100.000000% 99.999558%
mada_112.60x.noindels 100.000000% 99.861496% 99.999189% 99.999730%
mada_112.nobaq 100.000000% 99.722992% 100.000000% 99.999558%
mada_112.60x.nobaq 100.000000% 99.722992% 99.999975% 99.999582%
mada_112.60x 100.000000% 99.861496% 99.999189% 99.999730%
mada_112 100.000000% 100.000000% 99.999975% 99.999779%
mada_112.noindels 100.000000% 100.000000% 99.999975% 99.999779%
mbhall88 commented 4 years ago

Current plots for concordance and call rate with depth are in this report: https://github.com/mbhall88/head_to_head_pipeline/blob/400a4c1ff81b06d22626f51224c6dd00df1b1a8a/analysis/baseline_variants/report.html

mbhall88 commented 4 years ago

After filtering refinement using varifier (see https://github.com/mbhall88/head_to_head_pipeline/issues/39#issuecomment-673898811) we have a good checkpoint for the concordance at c645cd8. The report is uploaded to that commit along with all of the plots. A couple are pasted below.

ALT concordance

alt_concordance

Genome-wide concordance

gw_concordance