Code script to go from pandora VCF to DST result

mbhall88 commented 3 years ago

The first main step in producing resistance predictions is running pandora map on the AMR PRG (#66) with the reads.

The first attempt at this, I (randomly) picked sample R28581. The mykrobe call for this sample on both Illumina and Nanopore was resistant to Isoniazid, with variant katG_W191R and has a lot of supporting coverage.

The pandora genotyped VCF, selecting only alt calls

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample
katG    523     .       GGAC    CGAC,GGGA,GGGC,GGGG,GGGT,TGAC   .       .       SVTYPE=PH_SNPs;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      6:4,1,1,1,1,1,2:9,5,4,4,5,5,13:4,0,0,0,0,0,2:6,0,0,0,0,0,13:13,4,4,4,4,4,4:28,17,17,17,17,17,27:0,0.666667,0.75,0.75,0.666667,0.666667,0:-205.8,-264.967,-273.565,-273.565,-264.967,-264.967,-195.202:10.5978
katG    671     .       TGG     AGA,AGG,CGA,CGC,CGG,CGT,GGA,GGC,GGG,GGT .       .       SVTYPE=PH_SNPs;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      5:0,0,0,0,0,11,0,0,0,0,0:1,0,1,0,0,13,0,0,0,1,0:0,0,0,0,0,12,0,0,0,0,0:1,0,0,0,0,13,0,0,0,0,0:0,0,1,0,0,46,0,0,0,1,0:2,0,5,0,0,54,0,0,0,5,0:1,1,0.75,1,1,0,1,1,1,0.666667,1:-174.367,-182.34,-167.117,-182.34,-182.34,-16.7851,-182.34,-182.34,-182.34,-164.7,-182.34:147.915

and trimming the unused alt alleles

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample
katG    523     .       GGAC    TGAC    .       .       SVTYPE=PH_SNPs;GRAPHTYPE=SIMPLE;AC=1;AN=1       GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      1:4,2:9,13:4,2:6,13:13,4:28,27:0,0:-205.8,-195.202:10.5978
katG    671     .       TGG     CGG     .       .       SVTYPE=PH_SNPs;GRAPHTYPE=SIMPLE;AC=1;AN=1       GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      1:0,11:1,13:0,12:1,13:0,46:2,54:1,0:-174.367,-16.7851:147.915

using the current filtering strategy, the first variant, at position 523, would be filtered due to low FRS. The second variant would be kept.

The encouraging thing is if we look at that position in the panel VCF, the entry is

katG    671     katG_W191R      TGG     CGT,CGC,CGA,CGG,AGA,AGG 0       .       PAD=100;GENE=katG;VAR=W191R;RES=PROT;DRUGS=Isoniazid;ST=-

which is the variant we expect to find

mbhall88 commented 3 years ago

After thinking through a number of options for going from a Pandora VCF to a prediction, the method I have decided to pursue (at least initially - it may turn out to not work) is outlined below.

Filter the Pandora VCF: I don't think this requires much explanation and I will (initially) go with the filtering used in #48
Load Pandora VCF entries with an ALT genotype called into a HashMap where the key is a tuple of (gene, interval) where interval is the start/end range of the variant. This key will map to the VCF entry. A side note: we will also load filtered variants and make predictions on them, but keep those predictions will only be for debugging/investigative purposes. This will likely be useful for a multitude of reasons.
Loop over each variant in the panel VCF. Collect all Pandora variants for the same gene, whose interval has an intersection with the panel variant's interval. The advantage here is that we "seamlessly" handle when Pandora has merged panel variants in the PRG. For those intersecting Pandora variants, we look at the intersection interval and see whether the Pandora variant matches any of the panel variant ALTs. If it does match a panel variant ALT, we add an annotation to the VCF INFO field (say a tag called prediction) and write out an annotated version of the Pandora VCF, along with a mykrobe-style JSON.

One advantage of this approach is it removes the need to add another external dependency (bcftools) and also removes the need to deal with ORFs when trying to translate the Pandora consensus.

An example of how the comparison of the intersecting intervals will work is

We have panel variants

#CHROM    POS    ID    REF    ALT    INFO
gid    4    gid_M2I    ATG    ATA,ATC,ATT    DRUGS=drug1,drug2
gid    7    gid_L3F    TTA    TTC,TTT    DRUGS=drug3
gid    20    gid_A20C    A    C    drugs=drug4

and a Pandora variant

#CHROM    POS    ID    REF    ALT    ...    FORMAT    sample
gid    3    .    CATGTTAT    CATCTTAT,CATGTTCT    .    GT    1

The interval for the Pandora variant is [3, 11). So, for the first panel variant, we get an intersection interval of [4, 7). This interval on the called Pandora allele is ATC, which matches one of the panel variant ALT alleles, so we say this panel variant, gid_M2I, is present. For the next panel variant, the intersection interval is [7, 10). This interval on the called Pandora allele is TTA, which matches the panel variant REF, so we say it is not present. The third panel variant's interval does not intersect with the pandora variant, so we skip it. In the end, we say this sample is resistant to drug1 and drug2, due to variant gid_M2I

iqbal-lab commented 3 years ago

This is v slick. One minor issue (no pun intended), is AMR sites are enriched for having minor alleles present, which Pandora won't genotype as being present, as it does haploid calling. Conceivably you could in future keep a secondary hashmap for minor alleles?

Not sure if we can ever reliably detect minors with nanopore unless with huge depth. Mykrobe specifically does not call minors with nanopore

mbhall88 commented 3 years ago

Slick...that's a very nice compliment :blush:

Interestingly, if you look at the Pandora VCF above for R28581, the variant I said that would be filtered out, actually makes a call for panel variant katG_L141F. looking at the mykrobe info for this variant there is no evidence for coverage on the alt for either nanopore and Illumina, but pandora clearly thinks there's support for coverage on both ref and alt.
This doesn't directly answer your minor issue, but I guess it shows we should have enough info to try and do minors. I guess it is just a matter of having good data to test with

iqbal-lab commented 3 years ago

Well, we're could test by in silico mixing fastq from two isolates at different ratios I guess, choosing the isolates based on Compass vcfs so we get a mix of R and S alleles at a site. This definitely feels like a secondary thing though

mbhall88 commented 3 years ago

Very true. Maybe something for me to pursue in my postdoc :laughing:

mbhall88 commented 3 years ago

The vast majority of this is now done and implemented in drprg. There will be some debugging no doubt, but I think it is ok to close this.

mbhall88 / head_to_head_pipeline

Code script to go from pandora VCF to DST result #67