Strange difference in overlapping positions between make_prg versions

I have recently tested drprg out after upgrading make_prg from https://github.com/leoisl/make_prg/releases/tag/v0.3.0 to the latest v0.4.0

The results were almost the same across ~8500 samples. However, there were 3 that have gone from being TPs to FNs.

It has to do with a region in the gene gid where the positions have weird overlaps that differ between make_prg versions. This is all a bit easier if I put in some examples

Here is the region from https://github.com/leoisl/make_prg/releases/tag/v0.3.0 (this leads to a correct ALT call in POS 513)

gid     505     f3d1ef78        G       T       .       PASS    VC=SNP;GRAPHTYPE=NESTED;PDP=1,0 GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    0:12,0:7,0:12,0:7,0:25,0:15,0:0.5,1:-12.421,-127.498:115.077
gid     505     2de45418        GTCACGGGC       TTGGGCGGCAGCGACGCTGC    .       PASS    VC=COMPLEX;GRAPHTYPE=NESTED;PDP=0.722222,0.277778;VARID=gid_A138V,gid_R137W,gid_R137P,gid_A138T;PREDICT=F,F,F,F       GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:8,3:5,2:0,0:0,0:25,23:15,13:0.666667,0.833333:-39.9668,-86.3427:46.3759
gid     513     aac746d5        C       T       .       PASS    VC=SNP;GRAPHTYPE=NESTED;PDP=0,1;VARID=gid_A138T,gid_A138V;PREDICT=.,R   GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    1:0,23:0,13:0,23:0,13:0,23:0,13:1,0:-205.786,-7.87333:197.913

However, this is the same region with the latest v0.4.0 make_prg

gid     505     8b2d50ba        G       T       .       PASS    VC=SNP;GRAPHTYPE=NESTED;PDP=1,0 GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    0:12,0:7,0:12,0:7,0:25,0:15,0:0.5,1:-12.421,-127.498:115.077
gid     505     ec5fd1a2        GTCACGGGC       TTGGGCGGCAGCGACGCTGC    .       PASS    VC=COMPLEX;GRAPHTYPE=NESTED;PDP=0.722222,0.277778;VARID=gid_A138V,gid_A138T,gid_R137W,gid_R137P;PREDICT=.,.,.,.       GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:8,3:5,2:0,0:0,0:25,23:15,13:0.666667,0.833333:-39.9668,-86.3427:46.3759
gid     511     bad07f5f        GGC     GGT     .       frs     VC=PH_SNPs;GRAPHTYPE=NESTED;PDP=0.342105,0.657895;VARID=gid_R137W,gid_A138V,gid_R137P,gid_A138T;PREDICT=.,.,.,. GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    0:8,16:5,9:0,23:0,13:25,48:15,28:0.666667,0.333333:-132.07,-69.6442:62.4261

you can see the POS 511 variant is filtered due to FRS - there is coverage on the ref and alt, where the same variant (POS 513) in the previous version has no coverage on the ref.

The second variant in those examples overlaps both the first and third variants.

This 513C>T variant is discovered by denovo in both cases - i.e. the reference PRG is effectively just the first two variants

In both cases I am using pandora v0.10.0-alpha.0

The make_prg command is:

For the v0.3.0 version (you can find all the drprg/pandora/make_prg files at /hps/nobackup/iqbal/mbhall/drprg/tmp/predict-ERR2510499)

make_prg from_msa -t 4 -L 5 -N 5 -v -o updated -i /hps/nobackup/iqbal/mbhall/drprg/tmp/predict-ERR2510499/update_msas

and for the v0.4.0 version (all files are at /hps/nobackup/iqbal/mbhall/drprg/paper/results/amr_predictions/drprg/illumina/PRJEB25972/SAMEA1101807/ERR2510499/)

make_prg from_msa -t 2 -L 5 -N 5 -v --force -o updated -i /hps/nobackup/iqbal/mbhall/drprg/paper/results/amr_predictions/drprg/illumina/PRJEB25972/SAMEA1101807/ERR2510499/update_msas

Let me know if you need any other info @leoisl

Thanks for the bug description, it is complete, I don't think I need any more info.

Retelling the story in other words

This is just for a better understanding. I can see local genotyping was used here, so in the first case:

Here is the region from https://github.com/leoisl/make_prg/releases/tag/v0.3.0 (this leads to a correct ALT call in POS 513)

gid     505     f3d1ef78        G       T       .       PASS    VC=SNP;GRAPHTYPE=NESTED;PDP=1,0 GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    0:12,0:7,0:12,0:7,0:25,0:15,0:0.5,1:-12.421,-127.498:115.077
gid     505     2de45418        GTCACGGGC       TTGGGCGGCAGCGACGCTGC    .       PASS    VC=COMPLEX;GRAPHTYPE=NESTED;PDP=0.722222,0.277778;VARID=gid_A138V,gid_R137W,gid_R137P,gid_A138T;PREDICT=F,F,F,F       GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:8,3:5,2:0,0:0,0:25,23:15,13:0.666667,0.833333:-39.9668,-86.3427:46.3759
gid     513     aac746d5        C       T       .       PASS    VC=SNP;GRAPHTYPE=NESTED;PDP=0,1;VARID=gid_A138T,gid_A138V;PREDICT=.,R   GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    1:0,23:0,13:0,23:0,13:0,23:0,13:1,0:-205.786,-7.87333:197.913

What I think happened is:

Let's genotype the highest-likely variant (the 3rd one), 513T;
This genotyping makes any genotyping of 505 GTCACGGGC TTGGGCGGCAGCGACGCTGC incompatible, so lets genotype . for it;
Let's genotype the first variant to 505G;

For the 0.4.0 version:

gid     505     8b2d50ba        G       T       .       PASS    VC=SNP;GRAPHTYPE=NESTED;PDP=1,0 GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    0:12,0:7,0:12,0:7,0:25,0:15,0:0.5,1:-12.421,-127.498:115.077
gid     505     ec5fd1a2        GTCACGGGC       TTGGGCGGCAGCGACGCTGC    .       PASS    VC=COMPLEX;GRAPHTYPE=NESTED;PDP=0.722222,0.277778;VARID=gid_A138V,gid_A138T,gid_R137W,gid_R137P;PREDICT=.,.,.,.       GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:8,3:5,2:0,0:0,0:25,23:15,13:0.666667,0.833333:-39.9668,-86.3427:46.3759
gid     511     bad07f5f        GGC     GGT     .       frs     VC=PH_SNPs;GRAPHTYPE=NESTED;PDP=0.342105,0.657895;VARID=gid_R137W,gid_A138V,gid_R137P,gid_A138T;PREDICT=.,.,.,. GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    0:8,16:5,9:0,23:0,13:25,48:15,28:0.666667,0.333333:-132.07,-69.6442:62.4261

Let's genotype the highest-likely variant (the 1st) to 505G;
This makes it so that the only compatible genotyping for the 2nd variant is 505GTCACGGGC;
And the genotyping on the 2nd variant means that the only compatible genotyping for the 3rd variant is 511GGC, although 511GGT has higher likelihood.

Is the issue on make_prg or pandora?

I actually think the issue is on pandora. The main difference is that in v0.0.3 we have 513C>T and in v0.0.4 we have 511GGC>GGT. We have gaps in the bases 511GG, which makes the variant 513C>T not the highest-likely one anymore and does not drive the genotyping. I think the objective of make_prg is to describe the variants, and from a descriptive POV, 513C>T is equivalent to 511GGC>GGT, i.e. the only difference is 513C>T. I think it is actually pandora responsibility to just genotype the differences between alleles. Common bases should not interfere on the likelihood calculation.

Does genotyping only the differences solves this and similar cases?

For this particular case, yes, I think genotyping only the differences would solve it. But with local genotyping, we could easily make a similar example where we would still have an issue. Let's say that 505G has the highest likelihood, forcing 505GTCACGGGC and then 513C, whereas it should be 513T. I think one common issue in all these genotyping is that a choice on 505G>T is forcing a choice on 505GTCACGGGC>TTGGGCGGCAGCGACGCTGC. 505GTCACGGGC>TTGGGCGGCAGCGACGCTGC is a totally unreliable call: for the first allele, we have 66.6% of gaps (just 3 out of 9 bases have coverage), and for the second allele we have 83.3% of gaps (just 3 out of 20 bases have coverage). I think we should nullify any call on this variant. We should have a parameter --min-gaps, defaulting to 0.33 or 0.5, such that we should filter any calls with gaps higher than this. This would mean that in both cases we would filter out the 505GTCACGGGC>TTGGGCGGCAGCGACGCTGC variant, and then we would just genotype 505G>T and 513C>T (this last one with the feature of genotyping the differences only between alleles)

There is a bug on the gaps computation

If we look at this record:

gid     505     f3d1ef78        G       T       .       PASS    VC=SNP;GRAPHTYPE=NESTED;PDP=1,0 GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    0:12,0:7,0:12,0:7,0:25,0:15,0:0.5,1:-12.421,-127.498:115.077

The gaps for the first allele is 0.5, which I think is impossible given that we just have 1 base here. We should fix this.

Thoughts on local genotyping

I am not sure about the state of local genotyping. When we saw that global genotyping was better for E. coli, we deprecated local genotyping. We might have bugs we need to solve, or not... we need to revisit.

Discussion

Genotyping is a crucial step in pandora we haven't revisited. We are discovering some bugs, so we need to revisit it, and add more strict testing. But I'll let @iqbal-lab decide the priority of tasks

The whole idea of the genotyping model in gramtools was to fix the issue of "Common bases should not interfere on the likelihood calculation", so at some point i thought we could import that model into pandora.
i need to study the above a bit

I would argue this is also a make_prg issue. Shouldn't all three positions be a bubble as they have overlapping positions?

Common bases should not interfere on the likelihood calculation.

But they do because the GAPS influence the likelihood, hence why 513G>T (v0.3.0) seems a better choice than 511GGC>GGT (v0.4.0).

The gaps for the first allele is 0.5, which I think is impossible given that we just have 1 base here.

But the coverage on a base is made up for the mean of minimizers no? As such, it makes total sense as with a kmer size of 15, the coverage on this position is likely influenced by multiple minimzers, with some of those likely coming from the SNP upstream at position 513, thus creating a hole in coverage.

I really don't think pandora's genotyping is too bad here, I just think the PRG is confusing.

I'm seeing these annoying duplications of positions a bit lately. And they're causing me to miss calls in drprg.

A recent example I have. Here is an example of the graph (in VCF format)

katG    1044    ab6329a3        GC      AC,CA,CC

This position covers the S315T mutation (allele 3 is most common, but allele 2 is also S315T) in katG, which is SUPER important and common.

I have an example where accession ERR2516571 has this S315T mutation, plus a novel mutation I317V. As this is close, when the new mutation is added, we would expect it to be included in the position above.

This does happen, but something weird also happens

katG    1044    0ffa19b2        G       A       .       ld      VC=SNP;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-92,-92:0
katG    1044    e785a86a        GCGGCA  CAGGCA,CCGGCA,CCGGCG    .       ld      VC=PH_SNPs;GRAPHTYPE=NESTED;PDP=0,0,0.247059,0.752941   GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    0:0,0,12,35:0,0,9,29:0,0,0,36:0,0,0,29:0,0,72,179:0,0,58,149:1,1,0.666667,0:-483.439,-483.439,-336.376,-102.844:233.533

There is a duplication of the position 1044 with a SNP for some weird reason. That position has no depth, and gets genotyped as null. But the "correct" 1044 position gets genotyped as ref (assumably to make genotypes compatible?). We can see that pandora gets depth on the correct allele (allele 3), but doesn't call that genotype because of the confusing duplicate position.

Here is what de novo finds

1044    G       C
1049    A       G

@leoisl do you know why make_prg is making these weird duplicated positions?

I don't think this is a make_prg issue. Will provide data backing this up. So I basically got this site, and put the PRG string, PRG GFA and the MSA together. For the MSA, I did a deduplication, I got some flanks around this region and deduplicated the sequences. This is what I got:

First thing to note is that the PRG string corresponds to the PRG GFA, so no bugs there. I also excluded the flanking nodes of this site in the GFA, but they are basically flanked by two long nodes. If we take a look at the MSAs, we have 5 different strings (sorry that gcggca is duplicated - their flanking regions have a difference). Each of these 5 different strings correspond to one of the 5 different paths in the GFA. So, all alleles in the MSA are spelled in the GFA, and the GFA does not spell anything else that is not in the MSA.

So my conclusion is that the PRG string corresponds to the PRG GFA which corresponds to the MSA, i.e. the PRG is correctly representing the information we have in the MSA. Thus I don't think this is a make_prg issue, but rather a pandora issue. However, before proceeding the investigation on why this issue is potentially happening in pandora, I will wait for you to read this and agree or disagree that make_prg is building the correct PRG for this case.

The PRG string is technically representing this correctly, but my issue is I think the representation is somewhat flawed. With a minimum match length of 5, we should basically have a single site with 5 different alleles for this no? I don't get why the 1044 G A site is separate, it just complicates a relatively simple scenario.

The PRG string is technically representing this correctly, but my issue is I think the representation is somewhat flawed.

I agree, although the graph is correct, I think I found a bug in make_prg code that makes it produce a flawed representation.

With a minimum match length of 5, we should basically have a single site with 5 different alleles for this no?

From my debugging, this should not be the reason. When we reach this point in the recursion, with a minimum match length of 5 or even 6, we will want to cluster these sequences and recursively make prg on the subclusters because: 1) all sequences are large enough to cluster: https://github.com/iqbal-lab-org/make_prg/blob/c2c7ff0f40dabc5034e2bfad7d6c2229e5649b09/make_prg/from_msa/cluster_sequences.py#L229-L231 2) we have 5 different sequences to cluster, not two or less that we can skip clustering: https://github.com/iqbal-lab-org/make_prg/blob/c2c7ff0f40dabc5034e2bfad7d6c2229e5649b09/make_prg/from_msa/cluster_sequences.py#L238

In order to cluster then we get all distinct kmers, describe each sequence as a count matrix of these distinct kmers and call kmeans clustering, etc. With minimum match length of 6, we will have 5 different 6-mers, each sequence being one different 6-mer, and they will all be clustered into singleton clusters, which will make us list all 5 alleles directly, as you'd want the description to be.

With minimum match length of 5, however, the situation is a bit more complicated, as the alleles share kmers and then we might get two or more clusters. However, there is a small bug with the function that decides if we should cluster or not: https://github.com/iqbal-lab-org/make_prg/blob/c2c7ff0f40dabc5034e2bfad7d6c2229e5649b09/make_prg/from_msa/cluster_sequences.py#L107-L111

Basically, this function check if all clusters are one-ref like, i.e. if we can get a consensus sequence c for each cluster such that every sequence from that cluster has at most 20% hamming distance from c. If all clusters are one-ref like, then we can stop clustering, we found a good-enough clustering. For the example in this issue, if we take ccggca as the consensus sequence, then all alleles of this subMSA have a hamming distance of 1 to the consensus sequence, making the cluster a one-ref like, and thus not needing to cluster further, and we would then simply list all alleles. Unfortunately ccggca is not the chosen consensus sequence. We instead choose gcggca simply because we choose the consensus sequences on a non-deduplicated MSA, and this sequence is present much more frequently than the other:

Counter({'GCGGCA': 116, 'CCGGCA': 32, 'ACGGCA': 1, 'CAGGCA': 1, 'CCGGCG': 1})

With gcggca as consensus sequence, then ccggcg and caggca have a hamming distance of 2 to the consensus, making us wanting to cluster further. However, clearly ccggca is a better consensus sequence in this case, as it has hamming distance of 1 to all sequences in the MSA. So an easy fix would be to simply deduplicate the sequences when finding this consensus sequence: https://github.com/iqbal-lab-org/make_prg/blob/c2c7ff0f40dabc5034e2bfad7d6c2229e5649b09/make_prg/from_msa/cluster_sequences.py#L101

However, I think that, in general, whenever we are working with a (sub-)MSA, we can safely deduplicate the alleles. This would not only solve this issue, but also any other future issue with duplicated alleles, and will also result in a much better performance, as the gene MSAs in practice have very similar regions. I have just implemented this in this branch (see https://github.com/iqbal-lab-org/make_prg/commit/71b90f39e6e42e51e08e7d853c4ac11faa743336), and tested on this katG example and it worked. However, this minimal fix is all I could do for now, I will be able to go through this code in more details and add tests just later this week (34 out of 481 tests are failing in this branch, but I think is because we are not creating the exact same PRGs are previously - tests need update, not because this is bugged)... But it would be nice if you could test this fix, and if it would solve your missing-calls issue in drprg

Thanks for the detailed response. All of this makes sense. Deduplicating the alleles does seem like the better way to approach things I think.

If you tested it out on that katG example then I think that's fine. I'll have a dig through my results and see if I can find any other cases of this and test out your branch. Thanks a lot Leandro!

Hmmm, so this has popped it's head up a number of times now in pncA and rpoB and gid. Any idea when there might be a binary available with your fix @leoisl so I can test it out?

Hey @mbhall88 , you can download the binary here: https://www.dropbox.com/s/nl5q3zvmoaj86is/make_prg_0.4.1?dl=1 . Note that this is only the binary with my changes in this branch to try to fix this issue. I haven't had time to make unit tests and make this a stable version and properly release it, so be aware these changes are still untested... but I guess you will get a feeling if this is working given that you have examples where this issue happens and you can do a sort of regression test looking at the precision-recall of drprg using this new make-prg version.

Sorry for not being able to release this, all my time is being devoted to roundhound as we have some 3 deadlines coming up, with the earliest one being in 7 days, and there are still many things to do... OFC I can reprioritise things WRT what @iqbal-lab thinks should be the priority

I've tested that out on two samples and I still get the same problem.

Here is one of the examples that has a POS duplication at rpoB POS 1389 in /hps/nobackup/iqbal/mbhall/drprg/tmp/predict-ERR551578/ERR551578.drprg.bcf (the second of those positions should call the alternate allele but it can't because the first position is ref, even though the alternate call on the second position would still be compatible with the ref call on the first position)

make_prg from_msa -t 2 -L 5 -N 5 -v --force -o /hps/nobackup/iqbal/mbhall/drprg/tmp/predict-ERR551578/updated -i /hps/nobackup/iqbal/mbhall/drprg/tmp/predict-ERR551578/update_msas

iqbal-lab-org / make_prg