iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
107 stars 14 forks source link

Strange genotype selection #322

Closed mbhall88 closed 1 year ago

mbhall88 commented 1 year ago

I have a weird situation where pandora is selecting ref genotype when it should clearly be the alt - the alt has the highest likelihood.

gid     330     .       TGCCATTGGCGATAGCGCGGCCGGA       GGCTACGTCACGCACATTTGCCGGA       .       .       VC=PH_SNPs;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:7,0:5,0:10,0:9,0:57,0:47,0:0.375,1:-7.56853,-83.262:75.6935
gid     351     .       CGGA    CGGC,TGGA       .       .       VC=PH_SNPs;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:2,9,0:2,8,0:0,10,0:0,9,0:10,39,0:9,35,0:0.75,0,1:-95.4097,-21.0618,-124.709:74.3479

The position in question is 351. However, for some reason make_prg has not merged it with the variant above it - which happens to also span position 351. BUT, I am using --local genotyping, so pandora shouldn't be trying to make genotypes compatible in this scenario?

Here are the same positions in the consensus VCF

gid     330     .       TGCCATTGGCGATAGCGCGGCCGGA       GGCTACGTCACGCACATTTGCCGGA       .       .       VC=PH_SNPs;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS .:7,0:5,0:10,0:9,0:57,0:47,0:0.375,1
gid     351     .       CGGA    CGGC,TGGA       .       .       VC=PH_SNPs;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS 1:2,9,0:2,8,0:0,10,0:0,9,0:10,39,0:9,35,0:0.75,0,1

Any thoughts on what's happened here?

iqbal-lab commented 1 year ago
  1. This is weird, it has the highest likelihood
  2. Pandora does try to make things compatible with local genotyping as much as possible - the reason we abandoned it was that it became too hacky/messy to try and properly bolt on compatibility to local genotyping, and yet if you don't have compatibility, you can have conflicting genotypes.

I don't have a better suggestion than putting a breakpoint in and seeing what happens.

If you have time for a call next week, i'd be up for a chat as I don't remember/understand why you are using local genotyping, i forget the reason. I think it is/was because of minor alleles. Isn't your drprg minor analysis basically driven by looking at the coverage on both alleles, and could work the same on local and global genotyping? sorry for this basic question, have lost track

iqbal-lab commented 1 year ago

Sorry, was going to discuss with leandro, but he was teaching and then I was distracted. Will talk to him tomorrow!

mbhall88 commented 1 year ago

I don't remember/understand why you are using local genotyping, i forget the reason. I think it is/was because of minor alleles.

Motivation and justification.

Isn't your drprg minor analysis basically driven by looking at the coverage on both alleles, and could work the same on local and global genotyping?

Yes, but I still rely on the genotype. I only dig into null genotypes if they cross a start codon, and then, I'm just looking to see if we lose the gene (as a patch until #316 is sorted). I'm not really keen to ignore genotype and implement my own genotyping model when pandora has one already...

I dug into this a little more and the only reason I can see if genotype compatibility. I've removed the allele as POS 330 from the graph too as that is what was causing the problem and it only occurs in one sample that we build the PRG from (out of ~150 samples) and it doesn't contain any common resistance variants.

iqbal-lab commented 1 year ago

Fair enough, I'll chew on this. ,

leoisl commented 1 year ago

Sorry for the delay! I think what is happening in the genotyped VCF:

gid     330     .       TGCCATTGGCGATAGCGCGGCCGGA       GGCTACGTCACGCACATTTGCCGGA       .       .       VC=PH_SNPs;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:7,0:5,0:10,0:9,0:57,0:47,0:0.375,1:-7.56853,-83.262:75.6935
gid     351     .       CGGA    CGGC,TGGA       .       .       VC=PH_SNPs;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:2,9,0:2,8,0:0,10,0:0,9,0:10,39,0:9,35,0:0.75,0,1:-95.4097,-21.0618,-124.709:74.3479

with local genotyping is the following:

  1. pandora looks for the highest confidence call among all variants, and select that. That would be 330 calling ref (GT CONF 75.6935). This allele is TGCCATTGGCGATAGCGCGGCCGGA, spanning until pos 355;
  2. The next variant overlaps the 4 last bases of this first variant (these 4 last bases being CGGA), and has lower confidence than the first variant, so the choice on the first variant takes priority. Thus pandora makes the call on this second variant compatible with the first choice, forcing the call to be CGGA, i.e. also the ref.

IIRC, it is sort of a greedy algorithm, working from the highest to the lowest confidence calls, and making lower-confidence calls compatible with higher-confidence calls... I guess @rmcolq could correct us in case of any (or many) mistakes!

mbhall88 commented 1 year ago

That makes sense. I just thought that because I'm using local genotyping it would override that process?

iqbal-lab commented 1 year ago

That is how local genotyping works, or rather that's how Pandora tries to make local genotypes not self inconsistent.

leoisl commented 1 year ago

I think local genotyping is this process! Global genotyping will get the ML path and, for each site, will look at the most likely allele. If it agrees with the ML path (i.e. at site 1, ML path calls alt 1 and alt 1 is also the most likely) then it calls alt 1. Otherwise, it calls .. Local genotyping will instead work from the highest-confidence call towards the lowest one, possibly leaving the ML path, the goal is to make the lower-confidence calls compatible with higher-confidence calls if they are incompatible (either calling one of the alleles or nullifying the call)

leoisl commented 1 year ago

Sorry, did not see Zam's message before commenting, which is the same as mine, but much better summarised :joy: I think this is what happens, but also to be honest, last time I actually checked the genotyping code was some 2 years ago...

mbhall88 commented 1 year ago

Ohhh true. Yes, sorry I forgot global would just make it null. Thanks for clearing this up!