Closed leoisl closed 2 years ago
We can't use the same threshold as compass, there's no reason to believe it will give the same behaviour, data is totally different.
Re how to evaluate - leandro we just treat the illumina data as true, and want to reproduce their clusters.
There is some info here https://github.com/mbhall88/tbpore/issues/16#issuecomment-1120057184 Michael talks there about matching the genetic distances - i think we should just go straight to the clusters, and regenerate figure 4 here https://www.medrxiv.org/content/10.1101/2022.03.04.22271870v1.full-text
Basically, i would just reproduce this https://github.com/mbhall88/head_to_head_pipeline/blob/master/analysis/transmission_clustering/eda/clustering.ipynb
compares the illumina clusters with h2h - just replace h2h with our data
I am back at this, and will start adding data here. Some it might not be useful, but is better to document than lose it anyway. I just counted how many N
s we have in the compass consensus on the 151 samples, and on H2H
(keeping in mind H2H
==tbpore
now):
For compass
:
(base) codon-login-03:/hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/consensus
$ assembly-stats compass/compass.consensus.fa
N_count = 55922954
For H2H
:
(base) codon-login-03:/hps/nobackup/iqbal/mbhall/tech_wars/analysis/baseline_variants/consensus
$ assembly-stats bcftools/bcftools.consensus.fa
N_count = 151393909
Sure enough there are lots of calls here that are N
because of the mask. Mask size is 324971 bps * 151 samples = 49070621 bases masked, which means:
compass
discards 55922954-49070621=6852333 (~6.8M) bases/calls due to filters;H2H/tbpore
discards 151393909-49070621=102323288 (~102.3M) bases/calls due to filters; From the previous post, we know that ~116M calls are discarded due to being solely lq
(obviously many of these overlaps the masks, as 116M > 102M)Back at this with a different approach. We will just change filters and see the impact it makes to the clusters. Reducing the min_qual
filter from 85 to 15 made us undercluster:
Will try changing the thresholding params to see what we get
It does get much better changing the threshold to 6 and 12, but still not identical to illumina:
Not identical to illumina but better than H2H on all dimensions! SACR,SACP,XCR all better!
Cc @mbhall88
SACR for threshold 12 is worse than H2H (1.0). But otherwise fantastic!
What is the precision/recall change? i.e. Figure 2?
I am almost managing to reproduce the baseline_variants H2H
pipeline on my codon dir/env, only a single job failed, which I will check. Then I will be able to get precision/recall, but this was just a first test... I think min_qual 15
is way too low... Will rebuild clusters with a less lenient threshold.
PS: in the paper we have 1.0 SACR, we can get this with min_qual 15
by increasing the second ONT threshold from 12 to 15. The other metrics are better than the ones in the paper also, but just slightly. Will try to get more improvements by playing with the filters:
I got identical clustering results by using min_depth 5 (paper is min_depth 0), min_qual 25 (paper is 85, previous run was 15), min_mq 30 (mapping quality, paper is 0). These are filters I interpreted that compass use (I know data is different, etc, etc, but these are very reasonable filters). I think I will dive into producing the other plots of the paper to debug better if these filter changes are actually improving or not results with other metrics (e.g. precision/recall). I could also try to understand and debug what is the actual difference between the illumina and nanopore clusters (i.e. why is this sample S
added in the nanopore cluster, but not in the illumina):
this debugging would be to locate this sample in the illumina data, see what is the distance to the cluster-2 samples, manually check the snps between each pair of sample to understand why nanopore overcluster, etc...
What I would need from you is which direction to follow, as I am quite new to this work
i agree, let's look at precision/recall - great work!
i agree, let's look at precision/recall - great work!
Ok!
I got interested to analyse the previous case just after I posted that comment, so will just document it here, might not lead to nowhere.
I got interested to know which sample is this S
and why it got clustered in nanopore:
This is the illumina graph (2 samples, 0 distance between them):
This is the ONT graph (3 samples):
The main difference is samples R21408
being added to this cluster. I checked the distance between R21408
and R20574
in the compass matrix and is 13
, while we have 6
with ONT. I then compared R21408
, R20574
and H37RV
base by base:
(alt, ref) = 6
(ref, alt) = 7
thus the distance of 13.
Ignored bases (because one of them had a N
):
(N, alt) = 1
(N, ref) = 1337
(alt, N) = 14
(ref, N) = 21155
(alt, ref) = 6
thus the distance of 6, identical to compass.
Ignored bases (because one of them had a N
):
(N, alt) = 4
(N, ref) = 3577
(alt, N) = 134
(ref, N) = 4662
tbpore/H2H misses all the 7 (ref, alt)
differences between R20574
and R21408
. 7 alt
s in R21408
became N
s
I updated the previous comment, I checked the 7 alt
s in R21408
that became N
s. All were basically filtered out due to low frs
(right now set to 90%):
NC_000962.3 853668 . C T 69.8715 frs DP=51;ADF=5,23;ADR=5,4;SCR=42;VDB=0.85833;SGB=-0.693021;RPBZ=-1.52241;BQBZ=1.63105;SCBZ=-0.275267;FS=0;MQ0F=0;AC=1;AN=1;DP4=5,5,23,4;MQ=60 GT:PL:SP 1:179,82:14
NC_000962.3 1195464 . G C 91.9508 frs DP=48;ADF=3,9;ADR=2,17;SCR=45;VDB=0.623966;SGB=-0.692976;RPBZ=-1.37012;BQBZ=-2.53249;SCBZ=-0.837343;FS=0;MQ0F=0;AC=1;AN=1;DP4=3,2,9,17;MQ=60 GT:PL:SP 1:156,37:5
NC_000962.3 1975798 . C A 24.6074 lq;frs DP=53;ADF=5,7;ADR=6,15;SCR=41;VDB=0.709921;SGB=-0.692562;RPBZ=-0.97432;BQBZ=-0.957621;FS=0;MQ0F=0;AC=1;AN=1;DP4=5,6,7,15;MQ=60 GT:PL:SP 1:160,108:3
NC_000962.3 2206637 . C T 131.953 frs DP=62;ADF=4,22;ADR=8,20;SCR=54;VDB=0.537781;SGB=-0.693146;RPBZ=0.218528;BQBZ=1.48496;SCBZ=0.763027;FS=0;MQ0F=0;AC=1;AN=1;DP4=4,8,22,20;MQ=60 GT:PL:SP 1:255,96:5
NC_000962.3 2646501 . A G 82.893 frs DP=49;ADF=4,11;ADR=3,20;SCR=39;VDB=0.579948;SGB=-0.69311;RPBZ=0.18835;BQBZ=-2.60436;SCBZ=0.989511;FS=0;MQ0F=0;AC=1;AN=1;DP4=4,3,11,20;MQ=60 GT:PL:SP 1:173,63:4
NC_000962.3 2818484 . A C 163.061 frs DP=62;ADF=1,19;ADR=7,24;SCR=55;VDB=0.990162;SGB=-0.693146;RPBZ=-1.82649;BQBZ=-0.597542;SCBZ=0.388718;FS=0;MQ0F=0;AC=1;AN=1;DP4=1,7,19,24;MQ=60 GT:PL:SP 1:201,11:9
NC_000962.3 4354953 . G C 204.177 frs DP=74;ADF=4,24;ADR=3,27;SCR=62;VDB=0.821578;SGB=-0.693147;RPBZ=-0.155212;BQBZ=0.824997;SCBZ=-1.67087;FS=0;MQ0F=0;AC=1;AN=1;DP4=4,3,24,27;MQ=60 GT:PL:SP 1:231,0:2
The frs
of these 7 calls are:
DP4=5,5,23,4 = 27/37 = 72.9%
DP4=3,2,9,17 = 26/31 = 83.8%
DP4=5,6,7,15 = 22/33 = 66.6%
DP4=4,8,22,20 = 42/54 = 77.7%
DP4=4,3,11,20 = 31/38 = 81.5%
DP4=1,7,19,24 = 43/51 = 84.3%
DP4=4,3,24,27 = 51/58 = 87.9%
I think we can get closer clusters setting frs
to 80% (doing it now). This would allow us to fish back 4 out of 7 missing alts in R21408
that would increase the distance to the two other samples and remove it from the cluster. Do you think this value would be ok in general, or is it too lenient?
Setting frs
down to 80% did not work well:
i agree, let's look at precision/recall - great work!
Moving to this now
Summarising previous messages. This seems the best clustering we got by playing with filters (which can be obtained by using 1) min_qual 15 or 2) min_depth 5, min_qual 25, min_mq 30. I prefer option 2) as it is more strict):
Outdated
Problem overview
In
tbpore
, we downsample the reads depth to 150x and we have a lower bound for the phred-scaled VCF call quality of 85 (which accepts only 1 error in 100M - this is an oversimplification of the quality score, there are other data that also affect the score, but this is a good approximation). This causes us to flag and thus discard all non-perfect calls as low-quality calls (we define here a perfect call when all reads agree with the consensus, and non-perfect calls all the others (even if a single read disagree with the consensus out of ~150 or even 1000 reads, we will flag the call as low quality because our threshold for the call quality just accepts 1 error in 100M)). In practice, we see perfect calls having phred-scaled quality between 250 and 350, and quasi-perfect calls (the ones that have a single read disagreeing with the consensus) between 20 and 30, the exact values depend on how many reads overlap the locus and other info. The issue is that there are hundreds of thousands of quasi-perfect calls that get filtered out, and this becomes aN
in the consensus sequence. When we cluster based on the consensus sequence, we might overcluster as we will consider thatN
equals any base (if we compare two samples, and one calls analt
and the other calls anN
, but thisN
was a due to a quasi-perfect call toref
, then we will not take this difference in consideration, whereas we should).Data
I calculated the mean and max number of calls the
H2H
pipeline discards due to them being tagged as low quality (lq
flag) while having an "acceptable" quality (phred-scaled VCF call quality >= 20, i.e. we accept at most 1 error in 100).For nanopore, we have ~767k VCF calls (bases) on average per sample (and a max of 1158k calls) that are filtered out solely due to being
lq
while having quality>=20
. Note that these calls have no other flag, i.e. if we reduce the quality threshold from 85 to 20, we will fish back all these calls:If we reduce the quality threshold down to 20, we would instead filter out just ~1.72 calls per sample solely due to being
lq
(157 calls in all the 91 samples):These would be the 3
lq
calls with highest quality that we would ignore:The second one has 102 reads calling
ref
and only2
calling alt.We can't fairly compare the stats above with the Illumina calls. IDK much about the
H2H
pipeline, but it seems that the illumina VCFs were not generated by us, but by a thirdparty using the COMPASS pipeline? Anyway, I did not find the same flags between the ONT and the Illumina VCFs, but I think the Illumina VCFs are filtering with a phred-scaled VCF call quality >= 25, due to this header:If we assume this, we have much less calls filtered out (~255 per sample on average) solely due to being low quality (note that here we are not even requiring the quality to be >=20, although we know it is <25 already):
Conclusion
Just by looking at the nanopore stats, it seems we might be filtering way too much (average of ~17.4% of the genome). If when comparing two samples, an
alt
call matches an imperfectref
call of the other sample, or we have an imperfectalt
call in one of the samples, this difference will not be counted, and thus we might overcluster our samples. A possible solution is to just decrease the lower bound for the phred-scaled VCF call quality from 85 to 20 and evaluate it. The issue is that I don't know how to evaluate if results are better or not. 20 might be a too lenient value too, as we would discard just ~2 calls per sample. Maybe we should use the same value that COMPASS use, 25? We need a way to evaluate results and do a parameter sweep to find the best threshold. I am wondering if rerunning theH2H
pipeline help us infer if results are better.