mbhall88 / head_to_head_pipeline

Snakemake pipelines to run the analysis for the Illumina vs. Nanopore comparison.
GNU General Public License v3.0
5 stars 2 forks source link

Rebasecall nanopore data with guppy v5.0.16 #80

Closed mbhall88 closed 2 years ago

mbhall88 commented 2 years ago

This issue will summarise the changes in DST and clustering results after latest basecalling - with respect to the previous version, which was v3.4.5

mbhall88 commented 2 years ago

DST

Note: only mykrobe results are discussed here - drprg will be investigated further after this paper

Genotype concordance

Drugs with unchanged results: Ethambutol, Amikacin, Capreomycin, Ciprofloxacin, Kanamycin, Moxifloxacin, Ofloxacin, Pyrazinamide, Rifampicin

Isoniazid

Went from 9 FPs to 2 FPs (see #79)

One FP was due to Nanopore homopolymer deletion issues - mada_135 for isoniazid with mutations katG_GC1037C katG_CC1038C katG_CC1039C

One FP was in sample R20548 mutation katG_S315T and was filtered due to low Illumina depth (see below). This mutation had very high nanopore coverage (~110x). It should be noted that there were a number of low quality positions surrounding this site in both compass and bcftools VCFs.

Streptomycin

Went from 2 FNs to 1 FN. And from 2 FPs to 4 FPs (see below).

All 4 FPs were due to low Illumina depth (see below).

Summary

All mykrobe-Nanopore FNs are Illumina minor resistance (r) calls.

The remaining 11 FPs are all cases where mykrobe-Illumina called the alternate allele, but the call was filtered out due to LOW_TOTAL_DEPTH. This filter is controlled by the --min_proportion_expected_depth flag, which currently defaults to 30%. Do we want to play with this?

It should be noted that 6 out of the 11 FPs mentioned above relate to the same mutation in two samples and this mutation causes resistance to 3 drugs (amikacin, kanamycin, and capreomycin). So, really, these 11 FPs are best thought of as 7 FPs (2 x AMK/KAN/CAP, 4 x Strep., 1 x isoniazid)

Full data

Drug Tool Technology FN(R) FP(S) FNR(95% CI) FPR(95% CI) PPV(95% CI) NPV(95% CI)
Amikacin Mykrobe Nanopore 0(12) 2(139) 0.0% (0.0-24.2%) 1.4% (0.4-5.1%) 85.7% (60.1-96.0%) 100.0% (97.3-100.0%)
Capreomycin Mykrobe Nanopore 0(12) 2(139) 0.0% (0.0-24.2%) 1.4% (0.4-5.1%) 85.7% (60.1-96.0%) 100.0% (97.3-100.0%)
Ciprofloxacin Mykrobe Nanopore 1(17) 0(134) 5.9% (1.0-27.0%) 0.0% (0.0-2.8%) 100.0% (80.6-100.0%) 99.3% (95.9-99.9%)
Ethambutol Mykrobe Nanopore 2(56) 0(95) 3.6% (1.0-12.1%) 0.0% (0.0-3.9%) 100.0% (93.4-100.0%) 97.9% (92.8-99.4%)
Isoniazid Mykrobe Nanopore 0(80) 2(71) 0.0% (0.0-4.6%) 2.8% (0.8-9.7%) 97.6% (91.5-99.3%) 100.0% (94.7-100.0%)
Kanamycin Mykrobe Nanopore 0(13) 2(138) 0.0% (0.0-22.8%) 1.4% (0.4-5.1%) 86.7% (62.1-96.3%) 100.0% (97.3-100.0%)
Moxifloxacin Mykrobe Nanopore 1(17) 0(134) 5.9% (1.0-27.0%) 0.0% (0.0-2.8%) 100.0% (80.6-100.0%) 99.3% (95.9-99.9%)
Ofloxacin Mykrobe Nanopore 0(17) 0(134) 0.0% (0.0-18.4%) 0.0% (0.0-2.8%) 100.0% (81.6-100.0%) 100.0% (97.2-100.0%)
Pyrazinamide Mykrobe Nanopore 1(31) 0(120) 3.2% (0.6-16.2%) 0.0% (0.0-3.1%) 100.0% (88.6-100.0%) 99.2% (95.5-99.9%)
Rifampicin Mykrobe Nanopore 1(80) 0(71) 1.2% (0.2-6.7%) 0.0% (0.0-5.1%) 100.0% (95.4-100.0%) 98.6% (92.5-99.8%)
Streptomycin Mykrobe Nanopore 1(45) 4(106) 2.2% (0.4-11.6%) 3.8% (1.5-9.3%) 91.7% (80.4-96.7%) 99.0% (94.7-99.8%)
mbhall88 commented 2 years ago

DST

Phenotype concordance

Drugs with unchanged results: Amikacin, Ethambutol, Kanamycin, Ofloxacin, Rifampicin.

Capreomycin

1 extra FN for both Illumina and Nanopore. This was from sample mada_1-21, which previously failed QC due to nanopore coverage less than 30x, but which now passes as it has more than 30x coverage after rebasecalling.

Isoniazid

5 less FPs for Nanopore (see #79). Of the 4 FPs remaining, one is the indel mentioned in the above comment, the remaiing three are mutations called in both technologies and with strong support in both VCFs - fabG1_C-15T (x2) & katG_W191R

1 extra FN for Illumina and 2 Nanopore. This was from sample mada_1-21, which previously failed QC due to nanopore coverage less than 30x, but which now passes as it has more than 30x coverage after rebasecalling. The other nanopore FN (mada_1-17) is also a FN on Illumina (and has the novel mutation discussed next).

In the process of investigating this I have discovered something which I think is extremely important for us to address in this paper. 6/9 FNs have a mutation katG_R463L (found by drprg) which is not in the mykrobe catalogue. This mutation is listed in the recent WHO catalogue as not associated with resistance, yet I have found numerous examples of papers that say it is associated with INH resistance [1,2,3]. However, it should be noted that some have said the opposite [4 - cites other work showing same thing].

Streptomycin

One less FN. Of the remaining three FNs; one (mada_1-22)) I can't find any evidence for anything, one (mada_1-44) has strong VCF evidence (and unknown call in drprg) for rpsL_K88M, but this mutation is not in mykrobe catalogue - it is listed as associated with resistant in the new WHO catalogue, the final nanopore FN (mada_1-30) has a synonymous mutation for the mutation gid_A205A (gid_A205E causes resistance).

Two extra FPs. I haven't been through all of these FPs (but I can if needed). But refer to the above comment where many nanopore FPs had strong support in VCFs, but were filtered for low depth in Illumina.

Full data

Drug Technology Tool FN(R) FP(S) FNR(95% CI) FPR(95% CI) PPV(95% CI) NPV(95% CI)
Amikacin Illumina Mykrobe 2(11) 2(78) 18.2% (5.1-47.7%) 2.6% (0.7-8.9%) 81.8% (52.3-94.9%) 97.4% (91.1-99.3%)
Amikacin Nanopore Mykrobe 0(11) 2(78) 0.0% (0.0-25.9%) 2.6% (0.7-8.9%) 84.6% (57.8-95.7%) 100.0% (95.2-100.0%)
Capreomycin Illumina Mykrobe 1(1) 1(51) 100.0% (20.7-100.0%) 2.0% (0.3-10.3%) 0.0% (0.0-79.3%) 98.0% (89.7-99.7%)
Capreomycin Nanopore Mykrobe 1(1) 1(51) 100.0% (20.7-100.0%) 2.0% (0.3-10.3%) 0.0% (0.0-79.3%) 98.0% (89.7-99.7%)
Ethambutol Illumina Mykrobe 4(14) 16(77) 28.6% (11.7-54.6%) 20.8% (13.2-31.1%) 38.5% (22.4-57.5%) 93.8% (85.2-97.6%)
Ethambutol Nanopore Mykrobe 4(14) 14(77) 28.6% (11.7-54.6%) 18.2% (11.2-28.2%) 41.7% (24.5-61.2%) 94.0% (85.6-97.7%)
Isoniazid Illumina Mykrobe 9(51) 3(48) 17.6% (9.6-30.3%) 6.2% (2.1-16.8%) 93.3% (82.1-97.7%) 83.3% (71.3-91.0%)
Isoniazid Nanopore Mykrobe 9(51) 4(48) 17.6% (9.6-30.3%) 8.3% (3.3-19.6%) 91.3% (79.7-96.6%) 83.0% (70.8-90.8%)
Kanamycin Illumina Mykrobe 0(0) 1(52) - 1.9% (0.3-10.1%) 0.0% (0.0-79.3%) 100.0% (93.0-100.0%)
Kanamycin Nanopore Mykrobe 0(0) 1(52) - 1.9% (0.3-10.1%) 0.0% (0.0-79.3%) 100.0% (93.0-100.0%)
Ofloxacin Illumina Mykrobe 0(10) 4(77) 0.0% (-0.0-27.8%) 5.2% (2.0-12.6%) 71.4% (45.4-88.3%) 100.0% (95.0-100.0%)
Ofloxacin Nanopore Mykrobe 0(10) 4(77) 0.0% (-0.0-27.8%) 5.2% (2.0-12.6%) 71.4% (45.4-88.3%) 100.0% (95.0-100.0%)
Rifampicin Illumina Mykrobe 5(48) 1(44) 10.4% (4.5-22.2%) 2.3% (0.4-11.8%) 97.7% (88.2-99.6%) 89.6% (77.8-95.5%)
Rifampicin Nanopore Mykrobe 6(48) 1(44) 12.5% (5.9-24.7%) 2.3% (0.4-11.8%) 97.7% (87.9-99.6%) 87.8% (75.8-94.3%)
Streptomycin Illumina Mykrobe 5(8) 10(83) 62.5% (30.6-86.3%) 12.0% (6.7-20.8%) 23.1% (8.2-50.3%) 93.6% (85.9-97.2%)
Streptomycin Nanopore Mykrobe 3(8) 11(83) 37.5% (13.7-69.4%) 13.3% (7.6-22.2%) 31.2% (14.2-55.6%) 96.0% (88.9-98.6%)
iqbal-lab commented 2 years ago

This all looks great. I propose for the paper for concordance with illumina, we set truth to be mykrobe WITHOUT minor calls, as this is essentially the global standard. We advocate use of minors in the Mykrobe paper, but its not generally accepted practise. (Obviously we'll have to mention access to minors as an advantage of illumina).

iqbal-lab commented 2 years ago

Can discuss your issue about the mutations missing from the Mykrobe catalog. But I don't think this is important for this paper because the goal is to show, for a fixed catalogue, does nanopore allow same results as illumina. It's explicitly not about improving the catalogue.

mbhall88 commented 2 years ago

After the work in #81 and #82, here are the (tentative) final DST results

I have updated the table and figures in Authorea with these. Although the genotype concordance plot is not in there, just the table. Happy to add that figure too if needed? I thought the table was probably sufficient.

Phenotype concordance

image

Drug Technology Tool FN(R) FP(S) FNR(95% CI) FPR(95% CI) PPV(95% CI) NPV(95% CI)
Amikacin Illumina Mykrobe 1(11) 2(78) 9.1% (1.6-37.7%) 2.6% (0.7-8.9%) 83.3% (55.2-95.3%) 98.7% (93.0-99.8%)
Amikacin Nanopore Mykrobe 0(11) 2(78) 0.0% (0.0-25.9%) 2.6% (0.7-8.9%) 84.6% (57.8-95.7%) 100.0% (95.2-100.0%)
Capreomycin Illumina Mykrobe 1(1) 1(51) 100.0% (20.7-100.0%) 2.0% (0.3-10.3%) 0.0% (0.0-79.3%) 98.0% (89.7-99.7%)
Capreomycin Nanopore Mykrobe 1(1) 1(51) 100.0% (20.7-100.0%) 2.0% (0.3-10.3%) 0.0% (0.0-79.3%) 98.0% (89.7-99.7%)
Ethambutol Illumina Mykrobe 4(14) 14(77) 28.6% (11.7-54.6%) 18.2% (11.2-28.2%) 41.7% (24.5-61.2%) 94.0% (85.6-97.7%)
Ethambutol Nanopore Mykrobe 4(14) 14(77) 28.6% (11.7-54.6%) 18.2% (11.2-28.2%) 41.7% (24.5-61.2%) 94.0% (85.6-97.7%)
Isoniazid Illumina Mykrobe 9(51) 3(48) 17.6% (9.6-30.3%) 6.2% (2.1-16.8%) 93.3% (82.1-97.7%) 83.3% (71.3-91.0%)
Isoniazid Nanopore Mykrobe 9(51) 4(48) 17.6% (9.6-30.3%) 8.3% (3.3-19.6%) 91.3% (79.7-96.6%) 83.0% (70.8-90.8%)
Kanamycin Illumina Mykrobe 0(0) 1(52) - 1.9% (0.3-10.1%) 0.0% (0.0-79.3%) 100.0% (93.0-100.0%)
Kanamycin Nanopore Mykrobe 0(0) 1(52) - 1.9% (0.3-10.1%) 0.0% (0.0-79.3%) 100.0% (93.0-100.0%)
Ofloxacin Illumina Mykrobe 0(10) 4(77) 0.0% (-0.0-27.8%) 5.2% (2.0-12.6%) 71.4% (45.4-88.3%) 100.0% (95.0-100.0%)
Ofloxacin Nanopore Mykrobe 0(10) 4(77) 0.0% (-0.0-27.8%) 5.2% (2.0-12.6%) 71.4% (45.4-88.3%) 100.0% (95.0-100.0%)
Rifampicin Illumina Mykrobe 6(48) 1(44) 12.5% (5.9-24.7%) 2.3% (0.4-11.8%) 97.7% (87.9-99.6%) 87.8% (75.8-94.3%)
Rifampicin Nanopore Mykrobe 6(48) 1(44) 12.5% (5.9-24.7%) 2.3% (0.4-11.8%) 97.7% (87.9-99.6%) 87.8% (75.8-94.3%)
Streptomycin Illumina Mykrobe 4(8) 11(83) 50.0% (21.5-78.5%) 13.3% (7.6-22.2%) 26.7% (10.9-52.0%) 94.7% (87.2-97.9%)
Streptomycin Nanopore Mykrobe 3(8) 11(83) 37.5% (13.7-69.4%) 13.3% (7.6-22.2%) 31.2% (14.2-55.6%) 96.0% (88.9-98.6%)

Genotype concordance

image

Drug Tool Technology FN(R) FP(S) FNR(95% CI) FPR(95% CI) PPV(95% CI) NPV(95% CI)
Amikacin Mykrobe Nanopore 0(13) 1(138) 0.0% (0.0-22.8%) 0.7% (0.1-4.0%) 92.9% (68.5-98.7%) 100.0% (97.3-100.0%)
Capreomycin Mykrobe Nanopore 0(13) 1(138) 0.0% (0.0-22.8%) 0.7% (0.1-4.0%) 92.9% (68.5-98.7%) 100.0% (97.3-100.0%)
Ciprofloxacin Mykrobe Nanopore 0(16) 0(135) 0.0% (0.0-19.4%) 0.0% (0.0-2.8%) 100.0% (80.6-100.0%) 100.0% (97.2-100.0%)
Ethambutol Mykrobe Nanopore 0(54) 0(97) 0.0% (0.0-6.6%) 0.0% (0.0-3.8%) 100.0% (93.4-100.0%) 100.0% (96.2-100.0%)
Isoniazid Mykrobe Nanopore 0(81) 1(70) 0.0% (0.0-4.5%) 1.4% (0.3-7.7%) 98.8% (93.4-99.8%) 100.0% (94.7-100.0%)
Kanamycin Mykrobe Nanopore 0(14) 1(137) 0.0% (0.0-21.5%) 0.7% (0.1-4.0%) 93.3% (70.2-98.8%) 100.0% (97.3-100.0%)
Moxifloxacin Mykrobe Nanopore 0(16) 0(135) 0.0% (0.0-19.4%) 0.0% (0.0-2.8%) 100.0% (80.6-100.0%) 100.0% (97.2-100.0%)
Ofloxacin Mykrobe Nanopore 0(17) 0(134) 0.0% (0.0-18.4%) 0.0% (0.0-2.8%) 100.0% (81.6-100.0%) 100.0% (97.2-100.0%)
Pyrazinamide Mykrobe Nanopore 0(30) 0(121) 0.0% (0.0-11.4%) 0.0% (-0.0-3.1%) 100.0% (88.6-100.0%) 100.0% (96.9-100.0%)
Rifampicin Mykrobe Nanopore 0(79) 0(72) 0.0% (0.0-4.6%) 0.0% (0.0-5.1%) 100.0% (95.4-100.0%) 100.0% (94.9-100.0%)
Streptomycin Mykrobe Nanopore 0(47) 1(104) 0.0% (0.0-7.6%) 1.0% (0.2-5.2%) 97.9% (89.1-99.6%) 100.0% (96.4-100.0%)
mbhall88 commented 2 years ago

BCFtools filtering

These results have changed quite a bit since my thesis.

A recap - this is using bcftools v1.13 with guppy v5.0.16 data. Additionally, the recall and precision evaluation is now being done with hap.py.

Key for deciphering the x-axis labels in the figure below

Recall and precision are for SNPs only

image

And a plot showing the raw number of FNs and FPs

image

Subsequent clustering analysis is done using all filters (q85K90x20s1V1e-5).

mbhall88 commented 2 years ago

BCFtools dotplot

image

The inset shows all pairs of samples with an Illumina distance less than 20. The red points and area (without stripes) indicates pairs that have a nanopore distance greater than 12, but an Illumina distance <= 12 - i.e., a FN connection. The red area with the strips are pairs (depicted as squares) where the nanopore distance greater than 6 (but less than 12), but an (equivalent) Illumina distance <= 5. They are FNs at an Illumina threshold of 5, but not at 12.

The grey area is the inverse - i.e., Nanopore has made a connection when Illumina indicates there is none (FP).

mbhall88 commented 2 years ago

BCFtools clustering

This is the threshold sweep for each Illumina distance threshold. The x-axis shows the Nanopore threshold and each plot is a different Illumina threshold (distance).

image

Based on these sweeps, for Illumina thresholds 0, 2, 5, and 12; we use a Nanopore threshold of 3, 3, 6, and 12, respectively.

Threshold 0

image

Threshold 2

image

Threshold 5

image

Threshold 12

image


Given our focus in the paper is looking at how Nanopore compares to Illumina using PHE's process (i.e., clustering with a threshold of 12) these are great results.

Threshold 0 isn't great, but that seems pretty academic in the context of our aims I think.

Shall we remove thresholds 0 and 2, focus on 12, and show 5 for interests sake?

mbhall88 commented 2 years ago

Mixed dotplot

There are a few big outliers in the inset

mixed dotplot

and here is that plot with the outliers removed

image

So without those outliers, the results are very good.

My aim tomorrow is to investigate these outliers and see if there is anything obvious causing the disparity.

mbhall88 commented 2 years ago

Mixed simulations

I dropped thresholds 0 and 2 from these, but I can add them back in easily enough if we want them.

mixed_simulations

mbhall88 commented 2 years ago

Alright, so the big outliers in the mixed dotplot are the same from https://github.com/mbhall88/head_to_head_pipeline/issues/70#issuecomment-805361248

tl;dr: PPE54

There are some gaps in the mask in this gene, and nearly all of the discrepancies in the compass vs mixed distances lie within one of these gaps.

In general, what is happening is one sample's Illumina will call some decent quality SNPs (of REFs) in these regions, but the same sample's Nanopore will have a bunch of filtered calls for low quality and FRS. And then, the inverse happens in the other sample - i.e., the Illumina generally has no coverage in the gaps but the Nanopore has good quality calls. So when you do the same-tech distances, these sites are all ignored due to filtering on one of the techs, but when we do mixed, these sites pop up.

Who doesn't love PPE54?!

PS. the GC-content for PPE54 is only 61%...

In summary, I'm not sure there is much we can do to clean this up. But at least we have a pretty clear explanation.

mbhall88 commented 2 years ago

A comment from @iqbal-lab

i think a reviewer/reader will want to know that the extra clusters created by nanopore are genuinely reasonably close to being clusters, if you see what i mean

For threshold 0, the extra clusters have the following distances

sample1 sample2 COMPASS_dist ont_dist
17_616026 17_616156 1 1
R18040 R18043 3 2
R18040 R22601 1 0
R18043 R22601 2 1
R20983 R21770 2 1
R22601 R27937 3 3
R27657 R30078 3 1
mada_1-11 mada_115 6 3
mada_1-53 mada_2-53 1 0

For threshold 2, the one extra doubleton has the following distances

sample1 sample2 COMPASS_dist ont_dist
mada_1-11 mada_115 6 3

For threshold 5, the two extra clusters created have the following distances

sample1 sample2 COMPASS_dist ont_dist
R20260 R21408 13 6
R21408 R26791 21 5
mada_1-11 mada_1-28 9 6
mada_1-11 mada_1-8 12 5
mada_1-11 mada_115 6 3
mada_1-11 mada_125 8 4
mada_1-11 mada_142 8 6
mada_1-41 mada_2-31 6 4

The two samples on the bottom row are the doubleton cluster and the rest are in the bigger cluster. So all in all Nanopore is only off by a couple of SNPs

There were no extra clusters at threshold 12.