Closed mbhall88 closed 1 year ago
Yes, and just to keep our minds clean, we detect a minor allele (the genotype is major-REF plus minor-ALT), but this leads to a phenotype of R
First eyeball of the results of this feature are that it is way too permissive - there's a lot of minor alleles in each VCF. I'm working on adding in a second layer whereby the the minor allele must also have a GAPS value less than some threshold. Eyeballing some of the "true" minor alleles it seems they all have GAPS < 0.4 (most are 0.0)
Are you looking at illumina or nanopore data?
Illumina. I'm not going to both trying to do minor alleles on nanopore (I don't think there's any minor alleles in the H2H dataset anyway).
I'm surprised you're getting too many calls with illumina and a 10% threshold
Here are the Illumina results after running with the new implementation of minor allele calling (I chaged the markers so that you can see the error bars better). This is detecting a fraction of 0.2 with a maximum GAPS value of 0.4
The first result that sticks out here is the ETO specificity. We basically call all but 4 samples as resistant (out of ~7000 isolates with phenotype for ETO!!)
I obviously didn't look at all 7000 samples, but all of the samples I did look at have a minor resistance call for mutation ethA ACGGTTTAGCGC1479A. The interesting thing is that the proportion of depth on the ref and alt allele is almost exactly the same for all the samples I looked at (e.g, 0.59/0.41 ref/alt). However, the GAPS for all of the samples I looked at was 0.0/0.333 - we have the max. GAPS threshold set at 0.4. I will try lowering it to 0.3 and see what that looks like.
Tb-profiler does manage to have slightly higher sensitivity for nearly all drugs also
Oh that might be because they use threshold 0.1 and we use 0.2
Yeah possibly. I started with 0.2 and was thinking to finetune and then test out 0.1
Here are the results of changing the max. GAPS to 0.3
It fixes the ETO specificity issue above.
I did some digging into the fluoroquinolone (FQ) FNs as minor alleles are super important for these drugs.
There are two issues that have popped up in most of the cases.
When the minor allele's coverage is close to the major allele - i.e. fractions >0.4 for both alleles - pandora calls a null genotype. I saw this quite a bit (and also saw it a little for INH). When I switch to using local genotyping in pandora though it fixes these null calls. Given that, what is happening here is the maximum likelihood path is choosing on allele but genotyping is picking another, so we nullify the call. Does anyone forsee issues with switching to local genotyping?
There is a silent mutation gyrA S95T (which is susceptible) that kills an assumption I make when doing minor allele calling. My assumption is that if we have an ALT call, then skip checking for a minor allele. Which is fine in nearly all cases as ALTs generally mean resistance; but this allele doesn't mean resistance. So we call an alt, which is just the reference allele with this silent mutation and therefore don't look for minor alleles (which are there and in the same VCF position/site). Here is an example of this site
gyrA 380 a68d4d07 GACAG AACAC,CACAC,GACAC,GCCAC,GGCAC,TACAC,TACAG . PASS VC=PH_SNPs;GRAPHTYPE=NESTED;PDP=0,0,0,0.82392,0.17608,0,0,0;VARID=gyrA_D94G,gyrA_D94C,gyrA_D94A,gyrA_D94N,gyrA_D94H,gyrA_D94Y,gyrA_S95T;PREDICT=.,.,.,.,.,.,U GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 3:0,0,0,124,27,0,0,0:0,0,0,124,26,0,0,0:0,0,0,124,27,0,0,0:0,0,0,124,26,0,0,0:0,0,0,248,55,0,0,0:0,0,0,249,53,0,0,0:1,1,1,0,0,1,1,1:-1862.16,-1862.16,-1862.16,-247.957,-1250.38,-1862.16,-1862.16,-1862.16:1002.43
The silent mutation is allele 3.
So I think I will need to change some of my assumptions as this pops up a fair bit
Find the R allele with highest coverage Find the S with highest coverage If the R allele has >0.1 * the sum of these, you have an R call
- When Pandora makes a null call, does it still report depths? If yes, you can just look at the depth and make your own decisioj/call? Local genotyping....I'm not sure.
Yes, it does.
When I switch to using local genotyping, here is a table fo the changes in FNs and FPs
Tool | Drug | ΔFN | ΔFP |
---|---|---|---|
drprg | Amikacin | 0 | 0 |
drprg | Capreomycin | 0 | 0 |
drprg | Delamanid | 0 | 0 |
drprg | Ethambutol | 0 | -6 |
drprg | Ethionamide | -5 | 2 |
drprg | Isoniazid | -4 | 0 |
drprg | Kanamycin | 0 | 0 |
drprg | Levofloxacin | -3 | 1 |
drprg | Linezolid | 0 | 0 |
drprg | Moxifloxacin | -5 | 0 |
drprg | Ofloxacin | -1 | 0 |
drprg | Pyrazinamide | -2 | 0 |
drprg | Rifampicin | -2 | 0 |
drprg | Streptomycin | 1 | 0 |
e.g. A value of -2 means 2 less FN/FP
This seems like a strong case for local genotyping
- DrPrg could just ignore Pandora genotyping entirely, and do it's own using depth on alleles.
This seems unnecessary given the above I think?
Find the R allele with highest coverage Find the S with highest coverage If the R allele has >0.1 * the sum of these, you have an R call
This is almost what I do currently, but I just make some naive assumptions about REF and ALT alleles which I will fix and that wiill make it effectively the same as you suggest
Wow those are great results!
- When the minor allele's coverage is close to the major allele - i.e. fractions >0.4 for both alleles - pandora calls a null genotype. I saw this quite a bit (and also saw it a little for INH). When I switch to using local genotyping in pandora though it fixes these null calls. Given that, what is happening here is the maximum likelihood path is choosing on allele but genotyping is picking another, so we nullify the call. Does anyone forsee issues with switching to local genotyping?
Not really, we just switched away from local genotyping because we got worst results in the 20-way dataset. I remember checking a few calls manually, but not as much and not as deep as you are doing here, we were mostly driven by the final results. Would it be possible to have both VCFs from a sample (one with global and the other with local genotyping) and some calls where the local genotyping is better? I'd need that to be able to understand better what is happening. And if it is not an issue, the pandora compare output dir and command would be great too
my guess is michael is just going to go with local genotyping. I think there are two kinds of thing that can happen with local genotyping
Would it be possible to have both VCFs from a sample (one with global and the other with local genotyping) and some calls where the local genotyping is better? I'd need that to be able to understand better what is happening.
I don't have both on hand as they have been overridden with the local genotyping one. But could easily regenerate a pre-local one if you want. But what happened was when I look at the pandora consensus VCF the genotype was 0 and then genotyping VCF was trying to call GT 1, so pandora nullifies the call as a result.
And if it is not an issue, the pandora compare output dir and command would be great too
I don't use compare, just discover and map.
I suspect Zam's first point above is the likely situation
I don't use compare, just discover and map.
Fine, can be discover/map output dirs
This example means it was not a correct decision to deprecate the local genotyping for users, and it has its uses.
But what happened was when I look at the pandora consensus VCF the genotype was 0 and then genotyping VCF was trying to call GT 1, so pandora nullifies the call as a result.
This is one example I want to understand better why we have such a divergence, and if we have a better solution than simply nullifying the call. We need to revisit pandora genotyping and improve...
Could you rerun drprg using the most recent make_prg (0.4.0) and pandora (0.10.0-alpha.0) versions to check if results improve in any way?
Could you rerun drprg using the most recent make_prg (0.4.0) and pandora (0.10.0-alpha.0) versions to check if results improve in any way?
Yep, I will do this one at a time and check the change once this PR is complete
Alright, so after implementing the fix for point 2 here, the diff between the results before minor allele feature is
Tool | Drug | ΔFN | ΔFP |
---|---|---|---|
drprg | Amikacin | -8 | 7 |
drprg | Capreomycin | -5 | 3 |
drprg | Delamanid | 0 | 0 |
drprg | Ethambutol | -9 | 6 |
drprg | Ethionamide | -12 | 22 |
drprg | Isoniazid | -35 | 6 |
drprg | Kanamycin | -13 | 6 |
drprg | Levofloxacin | -21 | 4 |
drprg | Linezolid | 0 | 0 |
drprg | Moxifloxacin | -16 | 13 |
drprg | Ofloxacin | -5 | 1 |
drprg | Pyrazinamide | -2 | 3 |
drprg | Rifampicin | -4 | 0 |
drprg | Streptomycin | -9 | 2 |
The next two things to check before finalising this feature are using a minor allele fraction of 0.1 (results above were for 0.2) and removing the FRS filter
But wait , why suddenly so many more FPs?? Will only get worse if you lower to 0.1 and remove FRS?
More FPs because we are doing minor allele calling now.... This diff is with the results before I implemented minor allele calling. The diff before this was an earlier minor allele implementation before/after using local genotyping
Right but tb-profiler is also doing this without those FPs ? My expectation was that it would work well for fluoroquinolones at least (some precedent in mykrone paper and the new biorxiv from Alice brankin and Phil Fowler). Obviously those two aren't on this dataset tho
Right but tb-profiler is also doing this without those FPs ?
Not necessarily... These results are purely drprg FNs and FPs. We actually have less FPs than tb-profiler for all FQs at the moment
Drug | Tool | FN(R) | FP(S) | Sensitivity (95% CI) | Specificity (95% CI) |
---|---|---|---|---|---|
Amikacin | drprg | 69(485) | 57(6958) | 85.8% (82.4-88.6%) | 99.2% (98.9-99.4%) |
Amikacin | mykrobe | 93(485) | 51(6958) | 80.8% (77.1-84.1%) | 99.3% (99.0-99.4%) |
Amikacin | tbprofiler | 62(485) | 59(6958) | 87.2% (83.9-89.9%) | 99.2% (98.9-99.3%) |
Capreomycin | drprg | 57(235) | 95(2449) | 75.7% (69.9-80.8%) | 96.1% (95.3-96.8%) |
Capreomycin | mykrobe | 72(235) | 88(2449) | 69.4% (63.2-74.9%) | 96.4% (95.6-97.1%) |
Capreomycin | tbprofiler | 54(235) | 96(2449) | 77.0% (71.2-81.9%) | 96.1% (95.2-96.8%) |
Delamanid | drprg | 111(116) | 1(8152) | 4.3% (1.9-9.7%) | 100.0% (99.9-100.0%) |
Delamanid | mykrobe | 111(116) | 2(8152) | 4.3% (1.9-9.7%) | 100.0% (99.9-100.0%) |
Delamanid | tbprofiler | 111(116) | 2(8152) | 4.3% (1.9-9.7%) | 100.0% (99.9-100.0%) |
Ethambutol | drprg | 137(1538) | 742(4936) | 91.1% (89.6-92.4%) | 85.0% (83.9-85.9%) |
Ethambutol | mykrobe | 133(1538) | 747(4936) | 91.4% (89.8-92.7%) | 84.9% (83.8-85.8%) |
Ethambutol | tbprofiler | 118(1538) | 765(4936) | 92.3% (90.9-93.6%) | 84.5% (83.5-85.5%) |
Ethionamide | drprg | 329(1104) | 394(6105) | 70.2% (67.4-72.8%) | 93.5% (92.9-94.1%) |
Ethionamide | mykrobe | 265(1104) | 413(6105) | 76.0% (73.4-78.4%) | 93.2% (92.6-93.8%) |
Ethionamide | tbprofiler | 272(1104) | 414(6105) | 75.4% (72.7-77.8%) | 93.2% (92.6-93.8%) |
Isoniazid | drprg | 327(3900) | 170(4194) | 91.6% (90.7-92.4%) | 95.9% (95.3-96.5%) |
Isoniazid | mykrobe | 333(3900) | 170(4194) | 91.5% (90.5-92.3%) | 95.9% (95.3-96.5%) |
Isoniazid | tbprofiler | 297(3900) | 181(4194) | 92.4% (91.5-93.2%) | 95.7% (95.0-96.3%) |
Kanamycin | drprg | 129(670) | 107(6975) | 80.7% (77.6-83.6%) | 98.5% (98.1-98.7%) |
Kanamycin | mykrobe | 152(670) | 98(6975) | 77.3% (74.0-80.3%) | 98.6% (98.3-98.8%) |
Kanamycin | tbprofiler | 122(670) | 107(6975) | 81.8% (78.7-84.5%) | 98.5% (98.1-98.7%) |
Levofloxacin | drprg | 84(1040) | 101(5454) | 91.9% (90.1-93.4%) | 98.1% (97.8-98.5%) |
Levofloxacin | mykrobe | 88(1040) | 102(5454) | 91.5% (89.7-93.1%) | 98.1% (97.7-98.5%) |
Levofloxacin | tbprofiler | 85(1040) | 109(5454) | 91.8% (90.0-93.3%) | 98.0% (97.6-98.3%) |
Linezolid | drprg | 49(65) | 4(6110) | 24.6% (15.8-36.3%) | 99.9% (99.8-100.0%) |
Linezolid | mykrobe | 48(65) | 4(6110) | 26.2% (17.0-38.0%) | 99.9% (99.8-100.0%) |
Linezolid | tbprofiler | 48(65) | 5(6110) | 26.2% (17.0-38.0%) | 99.9% (99.8-100.0%) |
Moxifloxacin | drprg | 44(603) | 477(5431) | 92.7% (90.3-94.5%) | 91.2% (90.4-91.9%) |
Moxifloxacin | mykrobe | 44(603) | 473(5431) | 92.7% (90.3-94.5%) | 91.3% (90.5-92.0%) |
Moxifloxacin | tbprofiler | 42(603) | 482(5431) | 93.0% (90.7-94.8%) | 91.1% (90.3-91.9%) |
Ofloxacin | drprg | 26(105) | 5(424) | 75.2% (66.2-82.5%) | 98.8% (97.3-99.5%) |
Ofloxacin | mykrobe | 26(105) | 5(424) | 75.2% (66.2-82.5%) | 98.8% (97.3-99.5%) |
Ofloxacin | tbprofiler | 26(105) | 6(424) | 75.2% (66.2-82.5%) | 98.6% (96.9-99.3%) |
Pyrazinamide | drprg | 73(341) | 50(822) | 78.6% (73.9-82.6%) | 93.9% (92.1-95.4%) |
Pyrazinamide | mykrobe | 55(341) | 56(822) | 83.9% (79.6-87.4%) | 93.2% (91.3-94.7%) |
Pyrazinamide | tbprofiler | 45(341) | 62(822) | 86.8% (82.8-90.0%) | 92.5% (90.4-94.1%) |
Rifampicin | drprg | 138(3222) | 166(4586) | 95.7% (95.0-96.4%) | 96.4% (95.8-96.9%) |
Rifampicin | mykrobe | 164(3222) | 169(4586) | 94.9% (94.1-95.6%) | 96.3% (95.7-96.8%) |
Rifampicin | tbprofiler | 102(3222) | 177(4586) | 96.8% (96.2-97.4%) | 96.1% (95.5-96.7%) |
Streptomycin | drprg | 269(1042) | 132(1205) | 74.2% (71.4-76.7%) | 89.0% (87.2-90.7%) |
Streptomycin | mykrobe | 282(1042) | 135(1205) | 72.9% (70.2-75.5%) | 88.8% (86.9-90.5%) |
Streptomycin | tbprofiler | 257(1042) | 136(1205) | 75.3% (72.6-77.9%) | 88.7% (86.8-90.4%) |
These are the results from the most recent run.
Wow that comparison looks great. Moxi and ethionamide stand out for everyone tho, weird
No I did mean moxi , standing out for many FPs, so specificity is low
Frameshit eh
No I did mean moxi , standing out for many FPs, so specificity is low
Ah, right. Well the minor allele calling hasn't really increased the FPs much for it (see table here), but the sensitivity has come up a lot. I guess this is to be expected given Phil's recent paper
I'll take a look at those FPs once I've finalised the minor allele stuff
You are right, and anyway same for all tools
Alright, here is the diff when reducing minor allele calling from depth fraction 0.2 to 0.1
Tool | Drug | ΔFN | ΔFP |
---|---|---|---|
drprg | Amikacin | -1 | 0 |
drprg | Capreomycin | 0 | 0 |
drprg | Delamanid | 0 | 0 |
drprg | Ethambutol | -15 | 8 |
drprg | Ethionamide | 0 | 1 |
drprg | Isoniazid | -10 | 3 |
drprg | Kanamycin | -1 | 0 |
drprg | Levofloxacin | -3 | 2 |
drprg | Linezolid | 0 | 0 |
drprg | Moxifloxacin | -3 | 3 |
drprg | Ofloxacin | -2 | 0 |
drprg | Pyrazinamide | -1 | 1 |
drprg | Rifampicin | 0 | 1 |
drprg | Streptomycin | -7 | 2 |
I'm quite impressed with this. I was expecting a lot more FPs. So I think 0.1 is the winner!
Here is the latest Illumina results with the 0.1 fraction
Drug | Tool | FN(R) | FP(S) | Sensitivity (95% CI) | Specificity (95% CI) | MCC |
---|---|---|---|---|---|---|
Amikacin | drprg | 68(485) | 57(6958) | 86.0% (82.6-88.8%) | 99.2% (98.9-99.4%) | 0.861 |
Amikacin | mykrobe | 93(485) | 51(6958) | 80.8% (77.1-84.1%) | 99.3% (99.0-99.4%) | 0.836 |
Amikacin | tbprofiler | 62(485) | 59(6958) | 87.2% (83.9-89.9%) | 99.2% (98.9-99.3%) | 0.866 |
Capreomycin | drprg | 57(235) | 95(2449) | 75.7% (69.9-80.8%) | 96.1% (95.3-96.8%) | 0.672 |
Capreomycin | mykrobe | 72(235) | 88(2449) | 69.4% (63.2-74.9%) | 96.4% (95.6-97.1%) | 0.638 |
Capreomycin | tbprofiler | 54(235) | 96(2449) | 77.0% (71.2-81.9%) | 96.1% (95.2-96.8%) | 0.679 |
Delamanid | drprg | 111(116) | 1(8152) | 4.3% (1.9-9.7%) | 100.0% (99.9-100.0%) | 0.188 |
Delamanid | mykrobe | 111(116) | 2(8152) | 4.3% (1.9-9.7%) | 100.0% (99.9-100.0%) | 0.173 |
Delamanid | tbprofiler | 111(116) | 2(8152) | 4.3% (1.9-9.7%) | 100.0% (99.9-100.0%) | 0.173 |
Ethambutol | drprg | 122(1538) | 750(4936) | 92.1% (90.6-93.3%) | 84.8% (83.8-85.8%) | 0.693 |
Ethambutol | mykrobe | 133(1538) | 747(4936) | 91.4% (89.8-92.7%) | 84.9% (83.8-85.8%) | 0.689 |
Ethambutol | tbprofiler | 118(1538) | 765(4936) | 92.3% (90.9-93.6%) | 84.5% (83.5-85.5%) | 0.691 |
Ethionamide | drprg | 329(1104) | 395(6105) | 70.2% (67.4-72.8%) | 93.5% (92.9-94.1%) | 0.622 |
Ethionamide | mykrobe | 265(1104) | 413(6105) | 76.0% (73.4-78.4%) | 93.2% (92.6-93.8%) | 0.658 |
Ethionamide | tbprofiler | 272(1104) | 414(6105) | 75.4% (72.7-77.8%) | 93.2% (92.6-93.8%) | 0.653 |
Isoniazid | drprg | 317(3900) | 173(4194) | 91.9% (91.0-92.7%) | 95.9% (95.2-96.4%) | 0.879 |
Isoniazid | mykrobe | 333(3900) | 170(4194) | 91.5% (90.5-92.3%) | 95.9% (95.3-96.5%) | 0.876 |
Isoniazid | tbprofiler | 297(3900) | 181(4194) | 92.4% (91.5-93.2%) | 95.7% (95.0-96.3%) | 0.882 |
Kanamycin | drprg | 128(670) | 107(6975) | 80.9% (77.7-83.7%) | 98.5% (98.1-98.7%) | 0.805 |
Kanamycin | mykrobe | 152(670) | 98(6975) | 77.3% (74.0-80.3%) | 98.6% (98.3-98.8%) | 0.789 |
Kanamycin | tbprofiler | 122(670) | 107(6975) | 81.8% (78.7-84.5%) | 98.5% (98.1-98.7%) | 0.811 |
Levofloxacin | drprg | 81(1040) | 103(5454) | 92.2% (90.4-93.7%) | 98.1% (97.7-98.4%) | 0.896 |
Levofloxacin | mykrobe | 88(1040) | 102(5454) | 91.5% (89.7-93.1%) | 98.1% (97.7-98.5%) | 0.892 |
Levofloxacin | tbprofiler | 85(1040) | 109(5454) | 91.8% (90.0-93.3%) | 98.0% (97.6-98.3%) | 0.89 |
Linezolid | drprg | 49(65) | 4(6110) | 24.6% (15.8-36.3%) | 99.9% (99.8-100.0%) | 0.441 |
Linezolid | mykrobe | 48(65) | 4(6110) | 26.2% (17.0-38.0%) | 99.9% (99.8-100.0%) | 0.457 |
Linezolid | tbprofiler | 48(65) | 5(6110) | 26.2% (17.0-38.0%) | 99.9% (99.8-100.0%) | 0.447 |
Moxifloxacin | drprg | 41(603) | 480(5431) | 93.2% (90.9-94.9%) | 91.2% (90.4-91.9%) | 0.669 |
Moxifloxacin | mykrobe | 44(603) | 473(5431) | 92.7% (90.3-94.5%) | 91.3% (90.5-92.0%) | 0.669 |
Moxifloxacin | tbprofiler | 42(603) | 482(5431) | 93.0% (90.7-94.8%) | 91.1% (90.3-91.9%) | 0.668 |
Ofloxacin | drprg | 24(105) | 5(424) | 77.1% (68.2-84.1%) | 98.8% (97.3-99.5%) | 0.821 |
Ofloxacin | mykrobe | 26(105) | 5(424) | 75.2% (66.2-82.5%) | 98.8% (97.3-99.5%) | 0.808 |
Ofloxacin | tbprofiler | 26(105) | 6(424) | 75.2% (66.2-82.5%) | 98.6% (96.9-99.3%) | 0.802 |
Pyrazinamide | drprg | 72(341) | 51(822) | 78.9% (74.2-82.9%) | 93.8% (91.9-95.2%) | 0.741 |
Pyrazinamide | mykrobe | 55(341) | 56(822) | 83.9% (79.6-87.4%) | 93.2% (91.3-94.7%) | 0.77 |
Pyrazinamide | tbprofiler | 45(341) | 62(822) | 86.8% (82.8-90.0%) | 92.5% (90.4-94.1%) | 0.782 |
Rifampicin | drprg | 138(3222) | 167(4586) | 95.7% (95.0-96.4%) | 96.4% (95.8-96.9%) | 0.92 |
Rifampicin | mykrobe | 164(3222) | 169(4586) | 94.9% (94.1-95.6%) | 96.3% (95.7-96.8%) | 0.912 |
Rifampicin | tbprofiler | 102(3222) | 177(4586) | 96.8% (96.2-97.4%) | 96.1% (95.5-96.7%) | 0.927 |
Streptomycin | drprg | 262(1042) | 134(1205) | 74.9% (72.1-77.4%) | 88.9% (87.0-90.5%) | 0.647 |
Streptomycin | mykrobe | 282(1042) | 135(1205) | 72.9% (70.2-75.5%) | 88.8% (86.9-90.5%) | 0.629 |
Streptomycin | tbprofiler | 257(1042) | 136(1205) | 75.3% (72.6-77.9%) | 88.7% (86.8-90.4%) | 0.649 |
Thus is much better isn't it! Questions
just checking, is Mykrobe pinned to haploid here, or can it call minors?
it can call minor alleles
So, double checking: this is the same catalogue right? However tbprofiler takes vcf. Maybe some small issue in probe generation, especially where we have dense variation? But racon ought to plug those gaps. Weird. Are they all in the same samples?
100% the same catalogue for all tools. I'm going to dig into the discrepancies this week
It seems there are a decent number of cases where we don't call a minor allele because we fail the GAPS filer (min 0.3). However, in many of these cases, even the ref fails the GAPS filter. A lot of examples I am seeing have a small difference in GAPS between the ref and alt. (rpoB S450L has a lot of minor allele calls where both the ref and alt have GAPS 0.333). I'll try increasing the GAPS filter to 0.5, but also adding in a GAPS diff filter that sets an allowed max. difference between the ref and alt GAPS value.
The GAPS filter threshold is set in the expectation were looking at a major/haploid allele. I'd ignore it for minor allele detection, or as you're doing, change the threshold
Although ref failing gaps also is interesting
I added in a GAPS diff parameter. So we have a max GAPS value and a max GAPS diff, which is the difference between the major and minor allele's GAPS value (this can actually be negative - I found a case of the minor allele GAPS being lower than the called allele!).
I ran with max GAPS 0.5 and max diff 0.2 and I get the following diff
Tool | Drug | ΔFN | ΔFP |
---|---|---|---|
drprg | Amikacin | 0 | 0 |
drprg | Capreomycin | 0 | 0 |
drprg | Delamanid | 0 | 0 |
drprg | Ethambutol | -101 | 4135 |
drprg | Ethionamide | -1 | 3 |
drprg | Isoniazid | 0 | 0 |
drprg | Kanamycin | 0 | 0 |
drprg | Levofloxacin | -2 | 2 |
drprg | Linezolid | 0 | 1 |
drprg | Moxifloxacin | -1 | 0 |
drprg | Ofloxacin | 0 | 0 |
drprg | Pyrazinamide | -1 | 1 |
drprg | Rifampicin | -21 | 2 |
drprg | Streptomycin | 1 | 1 |
This is nearly great. We save a lot of RIF FNs. BUT, we end up calling everything EMB resistant by the looks of it. I'll dig into this today and see if I can get to the bottom of that
All of the EMB calls are due to the same position
embA 85 4b6d566c CCTACC CCTACA,CCTATC,TCTACC . PASS VC=PH_SNPs;GRAPHTYPE=SIMPLE;PDP=0.423729,0.169492,0.0847458,0.322034;OGT=0;VARID=embA_C-12T,embA_C-16G,embA_C-16T;PREDICT=.,.,r GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 3:18,8,4,14:7,2,1,5:25,0,0,12:7,0,0,3:90,25,25,59:37,7,7,22:0.4,0.666667,0.833333,0.5:-207.412,-316.464,-362.32,-249.056:41.6434
The minor allele has GAPS 0.5 and the ref has 0.4. This position seems to always have depth across the various alleles. My assumption is there is a shared minimizer between these alleles. For instance, the ref is CCTACC and the alt that always gets called is TCTACC. The last 5 bases of these two alleles are the same, so it is conceiveable that they share a minimizer that covers some/all of those 5 bases and not the first, given the small difference in GAPS.
I'll have a think about the best way to address this without losing the RIF TPs we gained.
If I break this variant into two sites (there is 3bp that match in between the first base and last two bases) this GAPS issue goes away.
So I see two options here
I'm kind of tempted to try both and see what happens separately for each...
Reducing the lin. match length to 3 gives the following diff (replaces the last diff)
Tool | Drug | ΔFN | ΔFP |
---|---|---|---|
drprg | Amikacin | 0 | 0 |
drprg | Capreomycin | 0 | 0 |
drprg | Delamanid | 0 | 0 |
drprg | Ethambutol | -8 | 24 |
drprg | Ethionamide | 3 | 2 |
drprg | Isoniazid | 12 | -1 |
drprg | Kanamycin | 0 | 0 |
drprg | Levofloxacin | -1 | 2 |
drprg | Linezolid | 0 | 1 |
drprg | Moxifloxacin | -1 | 0 |
drprg | Ofloxacin | 0 | 0 |
drprg | Pyrazinamide | 0 | 3 |
drprg | Rifampicin | -19 | 2 |
drprg | Streptomycin | 1 | 2 |
Not super happy with this one. Currently trying option 2 above
"This position seems to always have depth across the various alleles."
Could also be a paralogous kmer from elsewhere?
Bit weird, just this one position always having coverage on both alleles. Your idea of a shared minimizer would effectively mean those 2 alleles collapse to one path in the kmer graph instead of a bubble, or potentially the bubble becomes a bit shorter I suppose. Is there always the same coverage on both?
No, this is only an issue when we pay attention to minor alleles? So we always get a tiny bit of coverage on one allele? So this can't be a shared minimizer can it?
Maybe this is a mad idea, but what happens if you change your hash function?
No, this is only an issue when we pay attention to minor alleles? So we always get a tiny bit of coverage on one allele? So this can't be a shared minimizer can it?
All alleles get some coverage, but not the same amount on each. You're right, it can't be shared minimizers...
When I split the allele into two, I still got some coverage on the minor allele, but the GAPS increased on the minor and decreased on the called allele.
Maybe this is a mad idea, but what happens if you change your hash function?
As in the pandora hash function? I think this is what we currently use https://github.com/rmcolq/pandora/blob/945c0255e99ab32fd5b056dc1c16ffdf611ef478/thirdparty/src/inthash.cpp#L106-L117. Do you think there might be a hash collision?
This is the diff for using the max called GAPS filter (option 2 from above)
Tool | Drug | ΔFN | ΔFP |
---|---|---|---|
drprg | Amikacin | 0 | 0 |
drprg | Capreomycin | 0 | 0 |
drprg | Delamanid | 0 | 0 |
drprg | Ethambutol | 0 | 0 |
drprg | Ethionamide | -1 | 3 |
drprg | Isoniazid | 0 | 0 |
drprg | Kanamycin | 0 | 0 |
drprg | Levofloxacin | -2 | 2 |
drprg | Linezolid | 0 | 0 |
drprg | Moxifloxacin | -1 | 0 |
drprg | Ofloxacin | 0 | 0 |
drprg | Pyrazinamide | -1 | 1 |
drprg | Rifampicin | -19 | 2 |
drprg | Streptomycin | 1 | 1 |
This looks pretty good.
_Originally posted by @iqbal-lab in https://github.com/mbhall88/drprg-paper/issues/2
From https://github.com/mbhall88/drprg-paper/issues/2 we see that the bulk of the gap in sensitivity between drprg and tbprofiler for INH is to do with minor allele calls.
We can implement a variant of minor allele detectiong in drprg by looking at the coverage reported in the pandora VCF. If a variant is called reference allele, but we detect more than f (fraction) of the reads are from an allele that corresponds to a resistance mutation, then we call the resistance allele.