iqbal-lab / Mykrobe-predictor

Antibiotic resistance predictions in minutes on a laptop
Other
50 stars 19 forks source link

Force resistance predictions if species is not target species. #90

Closed Phelimb closed 8 years ago

iqbal-lab commented 8 years ago

What are the results on our existing staph/TB datasets with this commit/code?

iqbal-lab commented 8 years ago

I'm sure I saw a comment somewhere from you on the results with this code but can't find it

Phelimb commented 8 years ago

Sorry, haven't sent yet. Will do shortly.

I think we'll need to pull #89 first too.

iqbal-lab commented 8 years ago

Ah yes and #89 is the comment I'd seen before. OK, I guess we need data at both commits

Phelimb commented 8 years ago

Staph results for this PR:

Der: https://www.dropbox.com/s/1kdvcytyf6dus0f/der.xlsx?dl=0 Val : https://www.dropbox.com/s/rh8k8557e21p27p/val.xlsx?dl=0 Phe : https://www.dropbox.com/s/23n7j23uyho5exn/phe.xlsx?dl=0

Master:

Der : https://www.dropbox.com/s/vi2qt09r3cm4uhw/der_813.xlsx?dl=0 Val: https://www.dropbox.com/s/xi54wy6lzjbi3ye/val_813.xlsx?dl=0 Phe: TODO

Der summary: 7 fewer FP. No change in FN. Val summary: 1 fewer FP. No change in FN. Phe Summary: TODO

TB results to follow.

iqbal-lab commented 8 years ago

So these are figures for the code after both #89 and #90 are merged?

iqbal-lab commented 8 years ago

Sorry for accidental closing

Phelimb commented 8 years ago

These are actually results for #89. #90 is a trivial change that allows you to add --force to make resistance predictions even if species is wrong. See https://github.com/iqbal-lab/Mykrobe-predictor/issues/87 and https://github.com/iqbal-lab/Mykrobe-predictor/commit/0229013ea250916da579e701ea35b2e69c689cf7

iqbal-lab commented 8 years ago

Results good for DER and VAL for staph, but PHE is a problem. Across the board worse FN. Methicillin goes up from 4 to 11 (taking VME from 2 to 6). On the other hand FPs drop from awful to good.

Phelimb commented 8 years ago

Sorry, that was wrong commit for PHE (that was results with extended panel). Will rerun on Master to make sure and update the results.

iqbal-lab commented 8 years ago

While we're at it, can we add to this commit, or create now even if not in this repo, the means to auto compare a commit with previous or master or whatever. You are 90% there but still reporting overall confusion matrices only. I'd like input = two commits/branches and output=

  1. Are all results identical?
  2. Headlines on changes of FP, FN, VME, ME, PPV, NPV - overall
  3. as 2 but broken down by drugs
  4. List of samples where results change, with description of change

This is basically just a wrapper on what you have I think

Phelimb commented 8 years ago

OK, so below are the additional 11 FN in the PHE set in this PR. Top line is latest below is master. I haven't included the 15 additional TN (where master has a FP).

In summary, I think the code is doing the 'correct' thing. I.e. the genes that are no longer being called are statistically more likely to belong to the contaminate than S. aureus. This is probably not what we want in cases like H121340220-509-1 or H121800040-510-1 where the proportions are basically identical and the gene is on a plasmid.

I have a few proposed fixes:

1) Have an additional R model with is expected_covg + contamin_covg. This will fix multi copy plasmids. 2) If the coverages are very close together don't try to link gene to contamination as risk of FN worse than FP.

I'm also slightly worried that we're overestimating the abundance of CoGNs which is affecting calls. E.g. the additional FN call in H121320081-505-1 is caused by associating mecC with the S. epidermidis because they have identical coverages but I'd like to confirm that our estimate of S. epidermidis is not too high.

H121320081-505-1 Methicillin R S FN

H121320081-505-1 Methicillin R R TP

H121320081-505-1 Penicillin R S FN

H121320081-505-1 Penicillin R r TP

    "S. aureus": "71",
    "S. epidermidis": "39"

            "blaZ" :{
                    "per_cov": "72",
                    "median_cov": "6",
                    "conf": "19",
            "induced_resistance": "Penicillin"
            },

            "mecC" :{
                    "per_cov": "100",
                    "median_cov": "39",
                    "conf": "55",
            "induced_resistance": "Methicillin"
            },

            "dfrA" :{
                    "per_cov": "100",
                    "median_cov": "69",
                    "conf": "137",
            "induced_resistance": "Trimethoprim"
            },         

H121320082-506-1 Methicillin R S FN

H121320082-506-1 Methicillin R R TP

            "phylo_group": {
                    "Staphylococcus aureus": "90",
                    "Coagulase-Negative Staphylococcus": "53"
            },
            "species": {
                    "S. aureus": "90",
                    "S. epidermidis": "53"
            },
    "called_genes" :{
            "mecC" :{
                    "per_cov": "99",
                    "median_cov": "46",
                    "conf": "61",
            "induced_resistance": "Methicillin"
            }
    },

H121320099-507-1 Erythromycin R S FN H121320099-507-1 Penicillin R S FN

H121320099-507-1 Erythromycin R R TP H121320099-507-1 Penicillin R R TP

            "species": {
                    "S. aureus": "41",
                    "S. epidermidis": "61"
            },
            "ermC" :{
                    "per_cov": "99",
                    "median_cov": "369",
                    "conf": "970",
            "induced_resistance": "Erythromycin"
            },
            "blaZ" :{
                    "per_cov": "95",
                    "median_cov": "81",
                    "conf": "190",
            "induced_resistance": "Penicillin"
            },

H121340220-509-1 Methicillin R S FN H121340220-509-1 Erythromycin R S FN

H121340220-509-1 Methicillin R R TP H121340220-509-1 Erythromycin R R TP

            "species": {
                    "S. aureus": "57",
                    "S. epidermidis": "58"
            },
            "ermC" :{
                    "per_cov": "99",
                    "median_cov": "626",
                    "conf": "1655",
            "induced_resistance": "Erythromycin"
            },
            "mecA" :{
                    "per_cov": "99",
                    "median_cov": "75",
                    "conf": "163",
            "induced_resistance": "Methicillin"
            },

H121800040-510-1 Erythromycin R S FN

H121800040-510-1 Erythromycin R R TP

            "species": {
                    "S. aureus": "54",
                    "Unidentified Staphylococcus": "58"
            },
            "ermC" :{
                    "per_cov": "99",
                    "median_cov": "638",
                    "conf": "1689",
            "induced_resistance": "Erythromycin"
            },

RH11000301-155-1 Gentamicin R S FN

RH11000301-155-1 Gentamicin R r TP

            "species": {
                    "S. aureus": "169",
                    "S. epidermidis": "12"
            },
            "aacAaphD" :{
                    "per_cov": "98",
                    "median_cov": "9",
                    "conf": "26",
            "induced_resistance": "Gentamicin"
            }                

4009c4009 RH11000301-155-1 Penicillin R S FN

RH11000301-155-1 Penicillin R r TP "species": { "S. aureus": "169", "S. epidermidis": "12" }, "blaZ" :{ "per_cov": "100", "median_cov": "14", "conf": "46", "induced_resistance": "Penicillin" },
4669c4669 RH12000540-156-1 Penicillin R S FN

RH12000540-156-1 Penicillin R r TP

            "species": {
                    "S. aureus": "159",
                    "S. epidermidis": "5"
            },
            "blaZ" :{
                    "per_cov": "100",
                    "median_cov": "10",
                    "conf": "31",
            "induced_resistance": "Penicillin"
            },
iqbal-lab commented 8 years ago

OK, so taking a step back, we need some clarity here.

  1. If there is that much epidermidis in a PHE sample, the phenotype is suspect. Or there is contamination in the sequencing. Your model is calling resistance-due-to-aureus. Either the phenotype is resistant-due-to-epidermidis, or the sequencing experiment has epi contamination that was not in the pheno, or the pheno sample prep removed the epi by picking a colony. So what is the right thing to say?

a) for our measurements of how well Mykrobe does on a super PHE truth set, IF there really is that much epi, it needs to be excluded from that set as a truth set, and excluded from measures of VME/ME

b) for our measurements of how well Mykrobe does on real data, it is a valuable test. But here I think we should have a contamination/species tab, and say there it looks like meth resistant epi

ACTIONS:

  1. I agree about the estimate of epi. I would just map the reads to the epi genome and see what you see.

For your proposed fixes:

1) Have an additional R model with is expected_covg + contamin_covg. This will fix multi copy plasmids.

I don't understand this?

2) If the coverages are very close together don't try to link gene to contamination as risk of FN worse than FP.

I think when we see this, we need to be specifying clearly on the contamination section that we have a lot of contam, hard to infer where the gene/SNP is.

Phelimb commented 8 years ago

1) Have an additional R model with is expected_covg + contamin_covg. This will fix multi copy plasmids.

What I mean here is:

if we see for example:

aureus : 50 epi: 51

mecA 101

It's more likely that both the aureus and epi have mecA therefore meth R. Otherwise we would say more likely the mecA is from epi and therefore S.

iqbal-lab commented 8 years ago

OK, I understand. But in the case when we decided the mecA was from epi, we would mention the epi and the mec-from-epi on the contam tab?

Phelimb commented 8 years ago

Yes, that would be ideal.

iqbal-lab commented 8 years ago

OK, so

  1. We check how much epi there is in these samples and exclude some maybe from confusion matrices/VME estimation, and mention to Amy
  2. Test out your fix on the 3 datasets. Also, your fix - underlying you have an assumption that the same gene causes resistance in different organsisms (is reasonable), but also for SNPs. Prob true-ish, but don't know for sure. For MTBC/NTM, I don't know.
Phelimb commented 8 years ago

Test out your fix on the 3 datasets. Also, your fix - underlying you have an assumption that the same gene causes resistance in different organsisms (is reasonable), but also for SNPs. Prob true-ish, but don't know for sure. For MTBC/NTM, I don't know.

I'm not really assuming that the same gene causes resistance in different organisms. I'm just saying that it's more likely than not that S. aureus has the gene. The fact that Epi probably also has is only an aside.

iqbal-lab commented 8 years ago

Agreed.

Phelimb commented 8 years ago

OK, my fixes have reduced the FN down to 6 FN in 4 samples. All in samples with strong evidence for contamination.

I think that the resistotyping is doing the correct thing given the data in all case of FN calls below. Possibly we're overestimating the abundance of epi in H121320081-505-1 and H121320082-506-1 though which is causing this.

Barring any changes in val (running now) there are 16 fewer FP in this commit with 6 additional FN.

H121320081-505-1 Methicillin R S FN

H121320081-505-1 Methicillin R R TP

H121320081-505-1 Penicillin R S FN

H121320081-505-1 Penicillin R r TP

    "S. aureus": "71",
    "S. epidermidis": "39"

            "blaZ" :{
                    "per_cov": "72",
                    "median_cov": "6",
                    "conf": "19",
            "induced_resistance": "Penicillin"
            },

            "mecC" :{
                    "per_cov": "100",
                    "median_cov": "39",
                    "conf": "55",
            "induced_resistance": "Methicillin"
            },

            "dfrA" :{
                    "per_cov": "100",
                    "median_cov": "69",
                    "conf": "137",
            "induced_resistance": "Trimethoprim"
            },         

H121320082-506-1 Methicillin R S FN

H121320082-506-1 Methicillin R R TP

            "phylo_group": {
                    "Staphylococcus aureus": "90",
                    "Coagulase-Negative Staphylococcus": "53"
            },
            "species": {
                    "S. aureus": "90",
                    "S. epidermidis": "53"
            },
    "called_genes" :{
            "mecC" :{
                    "per_cov": "99",
                    "median_cov": "46",
                    "conf": "61",
            "induced_resistance": "Methicillin"
            }
    },                

RH11000301-155-1 Gentamicin R S FN

RH11000301-155-1 Gentamicin R r TP RH11000301-155-1 Penicillin R S FN

RH11000301-155-1 Penicillin R r TP

            "species": {
                    "S. aureus": "169",
                    "S. epidermidis": "12"
            },
            "aacAaphD" :{
                    "per_cov": "98",
                    "median_cov": "9",
                    "conf": "26",
            "induced_resistance": "Gentamicin"
            }                

            "blaZ" :{
                    "per_cov": "100",
                    "median_cov": "14",
                    "conf": "46",
            "induced_resistance": "Penicillin"
            },   

RH12000540-156-1 Penicillin R S FN

RH12000540-156-1 Penicillin R r TP

            "species": {
                    "S. aureus": "159",
                    "S. epidermidis": "5"
            },
            "blaZ" :{
                    "per_cov": "100",
                    "median_cov": "10",
                    "conf": "31",
            "induced_resistance": "Penicillin"
            },                        
iqbal-lab commented 8 years ago

OK, so can you resummarise after splitting samples as follows.

  1. Do not use FN/TN/TP/FP for samples with contam. Ie remove from der/val/PHE samples with contam. Then report on VME/FP/etc
  2. Are we doing right thing for contam samples
iqbal-lab commented 8 years ago

Also where did you get these phages, and why have you added them?

Phelimb commented 8 years ago

OK, by " with contam" do you mean at any level or ones with high-contamination that we don't claim we can handle. A few samples have low level contamination that we should be able to handle.

I got them from metaphlan they're being called in our FN PHE set. Wanted to see if they're in our epi panel by mistake. Probably not necessary to commit them but could be interesting to type regardless?

iqbal-lab commented 8 years ago

OK, by " with contam" do you mean at any level or ones with high-contamination that we don't claim we can handle. A few samples have low level contamination that we should be able to handle.

I think we should handle everything. But I don't think we should compare with phenotype when a) phenotype not well defined if there is contam b) we don't know if the sequenced sample=phenotyped sample I am trying to draw a distinction between (i) comparing with phenotype and measuring our error rate and (ii) making sure we do the right thing in any sample even with a lot of contam. Thus if there is a lot of contam, we definitely exclude from comparing with pheno. If there is a small amount of contam, well, open to debate.

For phages, when you say

I got them from metaphlan they're being called in our FN PHE set. What does that mean. The samples which we are calling FN in PHE all have these phages in them according to metaphlan? Interesting maybe yes, but remember phages can move between species potentially

iqbal-lab commented 8 years ago

Sorry, just reread your "Wanted to see if they're in our epi panel by mistake" - makes sense

Phelimb commented 8 years ago

Marginal change in MTBC val set :

1 additional FP in AMI and 1 additional TN in AMI.

There's also a fairly major change in ETH due to the fact that I accidentally included some of the panel for ETH that we excluded earlier in order to be more specific. This effectively adds the same number of TN as FP, so increases sensitivity for the sake of specificity or in other words reduced VME and increases ME. I actually think there's probably an argument for keeping it in there (or at least revisiting whether we want to favour sensitivity or specificity in TB)

iqbal-lab commented 8 years ago

OK, well I'm happy to revisit that question, but right now I'd like to stick to the existing panel, could you revert the ETH to the panel we had before?

iqbal-lab commented 8 years ago

Also, I think we have outstanding questions on staph?

OK, so can you resummarise after splitting samples as follows. Do not use FN/TN/TP/FP for samples with contam. Ie remove from der/val/PHE samples with contam. Then report on VME/FP/etc