mbhall88 / drprg

Drug Resistance Prediction with Reference Graphs
https://mbh.sh/drprg/
MIT License
19 stars 1 forks source link

False positive argmatch results #11

Closed mbhall88 closed 1 year ago

mbhall88 commented 2 years ago

The argmatch function currentlty does what it is supposed to; it returns whether a panel variant matches with a variant in the pandora VCF. But there can be false positives which lead to FP resistance calls. For example

gid     715     cd8b83c0        ACGACG  ACGACA,GCGACG   .       PASS    SVTYPE=PH_SNPs;GRAPHTYPE=SIMPLE;VARID=gid_A205E,gid_A205*;PREDICT=R,R        GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    2:13,10,25:14,9,27:12,0,24:14,0,28:55,30,102:57,28,110:0.5,0.666667,0:-360.9,-412.153,-214.773:146.127

(I've removed some VARIDs for brevity)

This variant simplifies to

gid     715     cd8b83c0        A  G   .       PASS    SVTYPE=PH_SNPs;GRAPHTYPE=SIMPLE;VARID=gid_A205E,gid_A205*;PREDICT=R,R        GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    2:13,10,25:14,9,27:12,0,24:14,0,28:55,30,102:57,28,110:0.5,0.666667,0:-360.9,-412.153,-214.773:146.127

The two variants gid_A205E and gid_A205* do overlap this VCF record, but they're not technically a match.

This position, 715, maps to codon 205, which in the reference is GCA. So this VCF position is the last base in the codon.

Changing the last base to G, as this record calls, would make the codon GCA->GCG which in amino space is A->A - i.e., synonymous. However, what argmatch does (by design) is look in the panel VCF and see that a G at this position matches with those other two variants, which it does, but it's complicated....

Here are the panel VCF records for those two variants

gid     713     gid_A205E       GCA     GAA,GAG 0       .       PAD=100;GENE=gid;VAR=A205E;RES=PROT;DRUGS=Streptomycin;ST=-
gid     713     gid_A205*       GCA     TGA,TAA,TAG     0       .       PAD=100;GENE=gid;VAR=A205*;RES=PROT;DRUGS=Streptomycin;ST=-

Extracting just position 715 from these variants does indeed provide A->G as an option. But it ignores the fact that you would also need to change positions 713 and/or 714, which our running example at the top does not.

I need to think about the cleanest way to handle this... Any thoughts are most welcome though.

iqbal-lab commented 2 years ago

Suggest you turn a catalogue into three catalogues

Part A: nucleotide matches Use VCF definition of catalogue, and match Drprg vcf with catalogue vcf. Some trickiness with indel normalisation.

Part B: amino variants First translate the pandora-inferred sequence into amino sequence Then,

Finally notice if there is

Part C