mbhall88 / tbpore

Mycobacterium tuberculosis genomic analysis from Nanopore sequencing data
MIT License
11 stars 2 forks source link

Threshold for the phred-scaled VCF call quality #22

Closed leoisl closed 2 years ago

leoisl commented 2 years ago

Problem overview

In tbpore, we downsample the reads depth to 150x and we have a lower bound for the phred-scaled VCF call quality of 85 (which accepts only 1 error in 100M - this is an oversimplification of the quality score, there are other data that also affect the score, but this is a good approximation). This causes us to flag and thus discard all non-perfect calls as low-quality calls (we define here a perfect call when all reads agree with the consensus, and non-perfect calls all the others (even if a single read disagree with the consensus out of ~150 or even 1000 reads, we will flag the call as low quality because our threshold for the call quality just accepts 1 error in 100M)). In practice, we see perfect calls having phred-scaled quality between 250 and 350, and quasi-perfect calls (the ones that have a single read disagreeing with the consensus) between 20 and 30, the exact values depend on how many reads overlap the locus and other info. The issue is that there are hundreds of thousands of quasi-perfect calls that get filtered out, and this becomes a N in the consensus sequence. When we cluster based on the consensus sequence, we might overcluster as we will consider that N equals any base (if we compare two samples, and one calls an alt and the other calls an N, but this N was a due to a quasi-perfect call to ref, then we will not take this difference in consideration, whereas we should).

Data

I calculated the mean and max number of calls the H2H pipeline discards due to them being tagged as low quality (lq flag) while having an "acceptable" quality (phred-scaled VCF call quality >= 20, i.e. we accept at most 1 error in 100).

For nanopore, we have ~767k VCF calls (bases) on average per sample (and a max of 1158k calls) that are filtered out solely due to being lq while having quality >=20. Note that these calls have no other flag, i.e. if we reduce the quality threshold from 85 to 20, we will fish back all these calls:

$ for f in `find /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/nanopore/filtered_snps/madagascar/ -name *.bcf`; do bcftools view $f | awk '$6>=20 && $7=="lq" ' | wc -l; done | datamash mean 1 max 1
767632.54945055 1158679

If we reduce the quality threshold down to 20, we would instead filter out just ~1.72 calls per sample solely due to being lq (157 calls in all the 91 samples):

$ for f in `find /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/nanopore/filtered_snps/madagascar/ -name *.bcf`; do bcftools view $f | awk '$6<20 && $7=="lq" ' | wc -l; done | datamash mean 1 max 1
1.7252747252747 34

These would be the 3 lq calls with highest quality that we would ignore:

NC_000962.3 3711723 .   A   .   19.695  lq  DP=59;ADF=26;ADR=21;SCR=18;VDB=0.112489;SGB=-0.590765;RPBZ=-2.43914;MQBZ=4.21187;MQSBZ=1.66428;BQBZ=-1.38604;SCBZ=-0.576347;FS=0;MQ0F=0.610169;AN=1;DP4=26,21,1,4;MQ=4  GT:SP   0:7
NC_000962.3 2635661 .   G   .   19.6942 lq  DP=150;ADF=73;ADR=29;SCR=97;VDB=0.18;SGB=-0.453602;RPBZ=0.698446;MQBZ=2.87256;MQSBZ=-1.63911;BQBZ=1.45854;SCBZ=-1.15867;FS=0;MQ0F=0.233333;AN=1;DP4=73,29,1,1;MQ=2GT:SP 0:3
NC_000962.3 3711721 .   T   .   19.377  lq  DP=75;ADF=33;ADR=23;SCR=20;VDB=0.116687;SGB=-0.511536;RPBZ=-2.82962;MQBZ=3.38282;MQSBZ=1.801;BQBZ=0.570723;SCBZ=1.42944;FS=0;MQ0F=0.48;AN=1;DP4=33,23,2,1;MQ=3  GT:SP   0:0

The second one has 102 reads calling ref and only 2 calling alt.

We can't fairly compare the stats above with the Illumina calls. IDK much about the H2H pipeline, but it seems that the illumina VCFs were not generated by us, but by a thirdparty using the COMPASS pipeline? Anyway, I did not find the same flags between the ONT and the Illumina VCFs, but I think the Illumina VCFs are filtering with a phred-scaled VCF call quality >= 25, due to this header:

##FILTER=<ID=S25,Number=0,Type=Float,Description="Base not called because min SNP quality less than 25">

If we assume this, we have much less calls filtered out (~255 per sample on average) solely due to being low quality (note that here we are not even requiring the quality to be >=20, although we know it is <25 already):

$ for f in `find /hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/illumina/gvcfs/ -name mada*.vcf.gz`; do zcat $f | awk '$7=="S25"' | wc -l; done | datamash mean 1 max 1
255.61016949153 5413

Conclusion

Just by looking at the nanopore stats, it seems we might be filtering way too much (average of ~17.4% of the genome). If when comparing two samples, an alt call matches an imperfect ref call of the other sample, or we have an imperfect alt call in one of the samples, this difference will not be counted, and thus we might overcluster our samples. A possible solution is to just decrease the lower bound for the phred-scaled VCF call quality from 85 to 20 and evaluate it. The issue is that I don't know how to evaluate if results are better or not. 20 might be a too lenient value too, as we would discard just ~2 calls per sample. Maybe we should use the same value that COMPASS use, 25? We need a way to evaluate results and do a parameter sweep to find the best threshold. I am wondering if rerunning the H2H pipeline help us infer if results are better.

iqbal-lab commented 2 years ago

We can't use the same threshold as compass, there's no reason to believe it will give the same behaviour, data is totally different.

iqbal-lab commented 2 years ago

Re how to evaluate - leandro we just treat the illumina data as true, and want to reproduce their clusters.

There is some info here https://github.com/mbhall88/tbpore/issues/16#issuecomment-1120057184 Michael talks there about matching the genetic distances - i think we should just go straight to the clusters, and regenerate figure 4 here https://www.medrxiv.org/content/10.1101/2022.03.04.22271870v1.full-text

iqbal-lab commented 2 years ago

Basically, i would just reproduce this https://github.com/mbhall88/head_to_head_pipeline/blob/master/analysis/transmission_clustering/eda/clustering.ipynb

compares the illumina clusters with h2h - just replace h2h with our data

leoisl commented 2 years ago

I am back at this, and will start adding data here. Some it might not be useful, but is better to document than lose it anyway. I just counted how many Ns we have in the compass consensus on the 151 samples, and on H2H (keeping in mind H2H==tbpore now):

For compass:

(base) codon-login-03:/hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/consensus
$ assembly-stats compass/compass.consensus.fa
N_count = 55922954

For H2H:

(base) codon-login-03:/hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/consensus
$ assembly-stats bcftools/bcftools.consensus.fa 
N_count = 151393909

Sure enough there are lots of calls here that are N because of the mask. Mask size is 324971 bps * 151 samples = 49070621 bases masked, which means:

leoisl commented 2 years ago

Back at this with a different approach. We will just change filters and see the impact it makes to the clusters. Reducing the min_qual filter from 85 to 15 made us undercluster:

image

Will try changing the thresholding params to see what we get

leoisl commented 2 years ago

It does get much better changing the threshold to 6 and 12, but still not identical to illumina:

image

iqbal-lab commented 2 years ago

Not identical to illumina but better than H2H on all dimensions! SACR,SACP,XCR all better!

iqbal-lab commented 2 years ago

Cc @mbhall88

mbhall88 commented 2 years ago

SACR for threshold 12 is worse than H2H (1.0). But otherwise fantastic!

What is the precision/recall change? i.e. Figure 2?

leoisl commented 2 years ago

I am almost managing to reproduce the baseline_variants H2H pipeline on my codon dir/env, only a single job failed, which I will check. Then I will be able to get precision/recall, but this was just a first test... I think min_qual 15 is way too low... Will rebuild clusters with a less lenient threshold.

PS: in the paper we have 1.0 SACR, we can get this with min_qual 15 by increasing the second ONT threshold from 12 to 15. The other metrics are better than the ones in the paper also, but just slightly. Will try to get more improvements by playing with the filters:

image

leoisl commented 2 years ago

I got identical clustering results by using min_depth 5 (paper is min_depth 0), min_qual 25 (paper is 85, previous run was 15), min_mq 30 (mapping quality, paper is 0). These are filters I interpreted that compass use (I know data is different, etc, etc, but these are very reasonable filters). I think I will dive into producing the other plots of the paper to debug better if these filter changes are actually improving or not results with other metrics (e.g. precision/recall). I could also try to understand and debug what is the actual difference between the illumina and nanopore clusters (i.e. why is this sample S added in the nanopore cluster, but not in the illumina):

image

this debugging would be to locate this sample in the illumina data, see what is the distance to the cluster-2 samples, manually check the snps between each pair of sample to understand why nanopore overcluster, etc...

What I would need from you is which direction to follow, as I am quite new to this work

iqbal-lab commented 2 years ago

i agree, let's look at precision/recall - great work!

leoisl commented 2 years ago

i agree, let's look at precision/recall - great work!

Ok!

leoisl commented 2 years ago

I got interested to analyse the previous case just after I posted that comment, so will just document it here, might not lead to nowhere.

I got interested to know which sample is this S and why it got clustered in nanopore: image

This is the illumina graph (2 samples, 0 distance between them): image

This is the ONT graph (3 samples):

image

The main difference is samples R21408 being added to this cluster. I checked the distance between R21408 and R20574 in the compass matrix and is 13, while we have 6 with ONT. I then compared R21408, R20574 and H37RV base by base:

  1. For compass: Comparing R20574.consensus.compass.fa R21408.consensus.compass.fa h37rv.fa.gz Bases taken into account:
    (alt, ref) = 6
    (ref, alt) = 7

    thus the distance of 13.

Ignored bases (because one of them had a N):

(N, alt) = 1
(N, ref) = 1337
(alt, N) = 14
(ref, N) = 21155
  1. For tbpore/H2H: Comparing R20574.consensus.tbpore.fa R21408.consensus.tbpore.fa h37rv.fa.gz Bases taken into account:
    (alt, ref) = 6

    thus the distance of 6, identical to compass.

Ignored bases (because one of them had a N):

(N, alt) = 4
(N, ref) = 3577
(alt, N) = 134
(ref, N) = 4662

tbpore/H2H misses all the 7 (ref, alt) differences between R20574 and R21408. 7 alts in R21408 became Ns

leoisl commented 2 years ago

I updated the previous comment, I checked the 7 alts in R21408 that became Ns. All were basically filtered out due to low frs (right now set to 90%):

NC_000962.3 853668  .   C   T   69.8715 frs DP=51;ADF=5,23;ADR=5,4;SCR=42;VDB=0.85833;SGB=-0.693021;RPBZ=-1.52241;BQBZ=1.63105;SCBZ=-0.275267;FS=0;MQ0F=0;AC=1;AN=1;DP4=5,5,23,4;MQ=60  GT:PL:SP    1:179,82:14
NC_000962.3 1195464 .   G   C   91.9508 frs DP=48;ADF=3,9;ADR=2,17;SCR=45;VDB=0.623966;SGB=-0.692976;RPBZ=-1.37012;BQBZ=-2.53249;SCBZ=-0.837343;FS=0;MQ0F=0;AC=1;AN=1;DP4=3,2,9,17;MQ=60    GT:PL:SP    1:156,37:5
NC_000962.3 1975798 .   C   A   24.6074 lq;frs  DP=53;ADF=5,7;ADR=6,15;SCR=41;VDB=0.709921;SGB=-0.692562;RPBZ=-0.97432;BQBZ=-0.957621;FS=0;MQ0F=0;AC=1;AN=1;DP4=5,6,7,15;MQ=60  GT:PL:SP    1:160,108:3
NC_000962.3 2206637 .   C   T   131.953 frs DP=62;ADF=4,22;ADR=8,20;SCR=54;VDB=0.537781;SGB=-0.693146;RPBZ=0.218528;BQBZ=1.48496;SCBZ=0.763027;FS=0;MQ0F=0;AC=1;AN=1;DP4=4,8,22,20;MQ=60    GT:PL:SP    1:255,96:5
NC_000962.3 2646501 .   A   G   82.893  frs DP=49;ADF=4,11;ADR=3,20;SCR=39;VDB=0.579948;SGB=-0.69311;RPBZ=0.18835;BQBZ=-2.60436;SCBZ=0.989511;FS=0;MQ0F=0;AC=1;AN=1;DP4=4,3,11,20;MQ=60 GT:PL:SP    1:173,63:4
NC_000962.3 2818484 .   A   C   163.061 frs DP=62;ADF=1,19;ADR=7,24;SCR=55;VDB=0.990162;SGB=-0.693146;RPBZ=-1.82649;BQBZ=-0.597542;SCBZ=0.388718;FS=0;MQ0F=0;AC=1;AN=1;DP4=1,7,19,24;MQ=60  GT:PL:SP    1:201,11:9
NC_000962.3 4354953 .   G   C   204.177 frs DP=74;ADF=4,24;ADR=3,27;SCR=62;VDB=0.821578;SGB=-0.693147;RPBZ=-0.155212;BQBZ=0.824997;SCBZ=-1.67087;FS=0;MQ0F=0;AC=1;AN=1;DP4=4,3,24,27;MQ=60  GT:PL:SP    1:231,0:2

The frs of these 7 calls are:

DP4=5,5,23,4 = 27/37 = 72.9%
DP4=3,2,9,17 = 26/31 = 83.8%
DP4=5,6,7,15 = 22/33 = 66.6%
DP4=4,8,22,20 = 42/54 = 77.7%
DP4=4,3,11,20 = 31/38 = 81.5%
DP4=1,7,19,24 = 43/51 = 84.3%
DP4=4,3,24,27 = 51/58 = 87.9%

I think we can get closer clusters setting frs to 80% (doing it now). This would allow us to fish back 4 out of 7 missing alts in R21408 that would increase the distance to the two other samples and remove it from the cluster. Do you think this value would be ok in general, or is it too lenient?

leoisl commented 2 years ago

Setting frs down to 80% did not work well:

image

leoisl commented 2 years ago

i agree, let's look at precision/recall - great work!

Moving to this now

leoisl commented 2 years ago

Summarising previous messages. This seems the best clustering we got by playing with filters (which can be obtained by using 1) min_qual 15 or 2) min_depth 5, min_qual 25, min_mq 30. I prefer option 2) as it is more strict):

image

leoisl commented 2 years ago

Outdated