mbhall88 / head_to_head_pipeline

Snakemake pipelines to run the analysis for the Illumina vs. Nanopore comparison.
GNU General Public License v3.0
5 stars 2 forks source link

Nanopore homopolymer deletions in katG #79

Closed mbhall88 closed 2 years ago

mbhall88 commented 2 years ago

Homopolymer deletions are causing (mykrobe) FP resistance calls in katG (isoniazid).

An example from a mykrobe report (mada_124)

"Isoniazid": {
                "predict": "R",
                "called_by": {
                    "katG_GC1037G-GC2155074C": {
                        "variant": null,
                        "genotype": [
                            1,
                            1
                        ],
                        "genotype_likelihoods": [
                            -954.1221622343683,
                            -588.6343185592842
                        ],
                        "info": {
                            "coverage": {
                                "reference": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 8,
                                    "min_non_zero_depth": 4,
                                    "kmer_count": 221,
                                    "klen": 21
                                },
                                "alternate": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 19,
                                    "min_non_zero_depth": 16,
                                    "kmer_count": 318,
                                    "klen": 18
                                }
                            },
                            "expected_depths": [
                                42
                            ],
                            "contamination_depths": [],
                            "filter": [],
                            "conf": 365
                        },
                        "_cls": "Call.VariantCall"
                    },
                    "katG_CC1038C-GG2155073G": {
                        "variant": null,
                        "genotype": [
                            1,
                            1
                        ],
                        "genotype_likelihoods": [
                            -988.6347876632344,
                            -532.4747749838733
                        ],
                        "info": {
                            "coverage": {
                                "reference": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 8,
                                    "min_non_zero_depth": 4,
                                    "kmer_count": 197,
                                    "klen": 21
                                },
                                "alternate": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 19,
                                    "min_non_zero_depth": 16,
                                    "kmer_count": 318,
                                    "klen": 18
                                }
                            },
                            "expected_depths": [
                                42
                            ],
                            "contamination_depths": [],
                            "filter": [],
                            "conf": 456
                        },
                        "_cls": "Call.VariantCall"
                    },
                    "katG_CC1039C-GG2155072G": {
                        "variant": null,
                        "genotype": [
                            1,
                            1
                        ],
                        "info": {
                            "coverage": {
                                "reference": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 8,
                                    "min_non_zero_depth": 4,
                                    "kmer_count": 198,
                                    "klen": 21
                                },
                                "alternate": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 19,
                                    "min_non_zero_depth": 16,
                                    "kmer_count": 318,
                                    "klen": 18
                                }
                            },
                            "expected_depths": [
                                42
                            ],
                            "contamination_depths": [],
                            "filter": [],
                            "conf": 452
                        },
                        "_cls": "Call.VariantCall"
                    }
                }
            },

The corresponding Illumina report has no support for these indels.

Currently, the underlying Nanopore data is basecalled with guppy v3.4.5. So, the first step is to test whether newer versions of guppy and tubby remove these errors or not.

To test this, there are six samples with these indel issues in katG

Rather than re-basecall all of these samples, we will first test different versions on three samples from the same Nanopore run to reduce the amount of basecalling we need to do: mada_117, mada_118, and mada_124.

Checklist

Basecall and run mykrobe with model versions:

mbhall88 commented 2 years ago

Here are the results.

sample guppy v3.4.5 guppy v3.6.0 tubby v3.6.0 guppy v5.0.16
mada_117 R S S S
mada_118 R S S S
mada_124 R S S S

R and S refer to isoniazid predictions. These samples have a phenotype of S and Illumina genotype of S.

Whilst all of the most recent guppy/tubby models get the right call, there is a bit of a difference in the coverage on alleles.

For example, in the mada_124 example indel above, if we look at katG_CC1039C in the different models we get the following (median) coverage.

Ref Alt
guppy v3.4.5 8 19
guppy v3.6.0 27 16
tubby v3.6.0 41 18
guppy v5.0.16 38 31

Sadly I do not have a guppy v5.0.16 tubby model yet to compare how that performs.

@iqbal-lab So the question is: which model do we proceed with?

On the one hand, tubby gives slightly better coverage in this example. However using guppy makes the paper methods etc. much simpler....

iqbal-lab commented 2 years ago

I am pretty amazed the problem just gets fixed by updating guppy! i would vote for keeping this simple and using guppy in this paper. (Still quite a lot of covg on the Alt in all cases, hard to avoid i guess)