Open mbhall88 opened 1 year ago
Thanks for the bug description, it is complete, I don't think I need any more info.
This is just for a better understanding. I can see local genotyping was used here, so in the first case:
Here is the region from https://github.com/leoisl/make_prg/releases/tag/v0.3.0 (this leads to a correct ALT call in POS 513)
gid 505 f3d1ef78 G T . PASS VC=SNP;GRAPHTYPE=NESTED;PDP=1,0 GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:12,0:7,0:12,0:7,0:25,0:15,0:0.5,1:-12.421,-127.498:115.077 gid 505 2de45418 GTCACGGGC TTGGGCGGCAGCGACGCTGC . PASS VC=COMPLEX;GRAPHTYPE=NESTED;PDP=0.722222,0.277778;VARID=gid_A138V,gid_R137W,gid_R137P,gid_A138T;PREDICT=F,F,F,F GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:8,3:5,2:0,0:0,0:25,23:15,13:0.666667,0.833333:-39.9668,-86.3427:46.3759 gid 513 aac746d5 C T . PASS VC=SNP;GRAPHTYPE=NESTED;PDP=0,1;VARID=gid_A138T,gid_A138V;PREDICT=.,R GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,23:0,13:0,23:0,13:0,23:0,13:1,0:-205.786,-7.87333:197.913
What I think happened is:
513T
;505 GTCACGGGC TTGGGCGGCAGCGACGCTGC
incompatible, so lets genotype .
for it;505G
;For the 0.4.0 version:
gid 505 8b2d50ba G T . PASS VC=SNP;GRAPHTYPE=NESTED;PDP=1,0 GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:12,0:7,0:12,0:7,0:25,0:15,0:0.5,1:-12.421,-127.498:115.077 gid 505 ec5fd1a2 GTCACGGGC TTGGGCGGCAGCGACGCTGC . PASS VC=COMPLEX;GRAPHTYPE=NESTED;PDP=0.722222,0.277778;VARID=gid_A138V,gid_A138T,gid_R137W,gid_R137P;PREDICT=.,.,.,. GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:8,3:5,2:0,0:0,0:25,23:15,13:0.666667,0.833333:-39.9668,-86.3427:46.3759 gid 511 bad07f5f GGC GGT . frs VC=PH_SNPs;GRAPHTYPE=NESTED;PDP=0.342105,0.657895;VARID=gid_R137W,gid_A138V,gid_R137P,gid_A138T;PREDICT=.,.,.,. GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:8,16:5,9:0,23:0,13:25,48:15,28:0.666667,0.333333:-132.07,-69.6442:62.4261
505G
;505GTCACGGGC
;511GGC
, although 511GGT
has higher likelihood.I actually think the issue is on pandora
. The main difference is that in v0.0.3
we have 513C>T
and in v0.0.4
we have 511GGC>GGT
. We have gaps in the bases 511GG
, which makes the variant 513C>T
not the highest-likely one anymore and does not drive the genotyping. I think the objective of make_prg
is to describe the variants, and from a descriptive POV, 513C>T
is equivalent to 511GGC>GGT
, i.e. the only difference is 513C>T
. I think it is actually pandora
responsibility to just genotype the differences between alleles. Common bases should not interfere on the likelihood calculation.
For this particular case, yes, I think genotyping only the differences would solve it. But with local genotyping, we could easily make a similar example where we would still have an issue. Let's say that 505G
has the highest likelihood, forcing 505GTCACGGGC
and then 513C
, whereas it should be 513T
. I think one common issue in all these genotyping is that a choice on 505G>T
is forcing a choice on 505GTCACGGGC>TTGGGCGGCAGCGACGCTGC
. 505GTCACGGGC>TTGGGCGGCAGCGACGCTGC
is a totally unreliable call: for the first allele, we have 66.6% of gaps (just 3 out of 9 bases have coverage), and for the second allele we have 83.3% of gaps (just 3 out of 20 bases have coverage). I think we should nullify any call on this variant. We should have a parameter --min-gaps
, defaulting to 0.33 or 0.5, such that we should filter any calls with gaps higher than this. This would mean that in both cases we would filter out the 505GTCACGGGC>TTGGGCGGCAGCGACGCTGC
variant, and then we would just genotype 505G>T
and 513C>T
(this last one with the feature of genotyping the differences only between alleles)
If we look at this record:
gid 505 f3d1ef78 G T . PASS VC=SNP;GRAPHTYPE=NESTED;PDP=1,0 GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:12,0:7,0:12,0:7,0:25,0:15,0:0.5,1:-12.421,-127.498:115.077
The gaps for the first allele is 0.5, which I think is impossible given that we just have 1 base here. We should fix this.
I am not sure about the state of local genotyping. When we saw that global genotyping was better for E. coli, we deprecated local genotyping. We might have bugs we need to solve, or not... we need to revisit.
Genotyping is a crucial step in pandora
we haven't revisited. We are discovering some bugs, so we need to revisit it, and add more strict testing. But I'll let @iqbal-lab decide the priority of tasks
I would argue this is also a make_prg issue. Shouldn't all three positions be a bubble as they have overlapping positions?
Common bases should not interfere on the likelihood calculation.
But they do because the GAPS influence the likelihood, hence why 513G>T (v0.3.0) seems a better choice than 511GGC>GGT (v0.4.0).
The gaps for the first allele is 0.5, which I think is impossible given that we just have 1 base here.
But the coverage on a base is made up for the mean of minimizers no? As such, it makes total sense as with a kmer size of 15, the coverage on this position is likely influenced by multiple minimzers, with some of those likely coming from the SNP upstream at position 513, thus creating a hole in coverage.
I really don't think pandora's genotyping is too bad here, I just think the PRG is confusing.
I'm seeing these annoying duplications of positions a bit lately. And they're causing me to miss calls in drprg.
A recent example I have. Here is an example of the graph (in VCF format)
katG 1044 ab6329a3 GC AC,CA,CC
This position covers the S315T mutation (allele 3 is most common, but allele 2 is also S315T) in katG, which is SUPER important and common.
I have an example where accession ERR2516571 has this S315T mutation, plus a novel mutation I317V. As this is close, when the new mutation is added, we would expect it to be included in the position above.
This does happen, but something weird also happens
katG 1044 0ffa19b2 G A . ld VC=SNP;GRAPHTYPE=NESTED GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:0,0:0,0:0,0:0,0:0,0:0,0:1,1:-92,-92:0
katG 1044 e785a86a GCGGCA CAGGCA,CCGGCA,CCGGCG . ld VC=PH_SNPs;GRAPHTYPE=NESTED;PDP=0,0,0.247059,0.752941 GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:0,0,12,35:0,0,9,29:0,0,0,36:0,0,0,29:0,0,72,179:0,0,58,149:1,1,0.666667,0:-483.439,-483.439,-336.376,-102.844:233.533
There is a duplication of the position 1044 with a SNP for some weird reason. That position has no depth, and gets genotyped as null. But the "correct" 1044 position gets genotyped as ref (assumably to make genotypes compatible?). We can see that pandora gets depth on the correct allele (allele 3), but doesn't call that genotype because of the confusing duplicate position.
Here is what de novo finds
1044 G C
1049 A G
@leoisl do you know why make_prg is making these weird duplicated positions?
I don't think this is a make_prg
issue. Will provide data backing this up. So I basically got this site, and put the PRG string, PRG GFA and the MSA together. For the MSA, I did a deduplication, I got some flanks around this region and deduplicated the sequences. This is what I got:
First thing to note is that the PRG string corresponds to the PRG GFA, so no bugs there. I also excluded the flanking nodes of this site in the GFA, but they are basically flanked by two long nodes. If we take a look at the MSAs, we have 5 different strings (sorry that gcggca
is duplicated - their flanking regions have a difference). Each of these 5 different strings correspond to one of the 5 different paths in the GFA. So, all alleles in the MSA are spelled in the GFA, and the GFA does not spell anything else that is not in the MSA.
So my conclusion is that the PRG string corresponds to the PRG GFA which corresponds to the MSA, i.e. the PRG is correctly representing the information we have in the MSA. Thus I don't think this is a make_prg
issue, but rather a pandora
issue. However, before proceeding the investigation on why this issue is potentially happening in pandora
, I will wait for you to read this and agree or disagree that make_prg
is building the correct PRG for this case.
The PRG string is technically representing this correctly, but my issue is I think the representation is somewhat flawed. With a minimum match length of 5, we should basically have a single site with 5 different alleles for this no? I don't get why the 1044 G A site is separate, it just complicates a relatively simple scenario.
The PRG string is technically representing this correctly, but my issue is I think the representation is somewhat flawed.
I agree, although the graph is correct, I think I found a bug in make_prg
code that makes it produce a flawed representation.
With a minimum match length of 5, we should basically have a single site with 5 different alleles for this no?
From my debugging, this should not be the reason. When we reach this point in the recursion, with a minimum match length of 5 or even 6, we will want to cluster these sequences and recursively make prg on the subclusters because: 1) all sequences are large enough to cluster: https://github.com/iqbal-lab-org/make_prg/blob/c2c7ff0f40dabc5034e2bfad7d6c2229e5649b09/make_prg/from_msa/cluster_sequences.py#L229-L231 2) we have 5 different sequences to cluster, not two or less that we can skip clustering: https://github.com/iqbal-lab-org/make_prg/blob/c2c7ff0f40dabc5034e2bfad7d6c2229e5649b09/make_prg/from_msa/cluster_sequences.py#L238
In order to cluster then we get all distinct kmers, describe each sequence as a count matrix of these distinct kmers and call kmeans clustering, etc. With minimum match length of 6, we will have 5 different 6-mers, each sequence being one different 6-mer, and they will all be clustered into singleton clusters, which will make us list all 5 alleles directly, as you'd want the description to be.
With minimum match length of 5, however, the situation is a bit more complicated, as the alleles share kmers and then we might get two or more clusters. However, there is a small bug with the function that decides if we should cluster or not: https://github.com/iqbal-lab-org/make_prg/blob/c2c7ff0f40dabc5034e2bfad7d6c2229e5649b09/make_prg/from_msa/cluster_sequences.py#L107-L111
Basically, this function check if all clusters are one-ref like, i.e. if we can get a consensus sequence c
for each cluster such that every sequence from that cluster has at most 20% hamming distance from c
. If all clusters are one-ref like, then we can stop clustering, we found a good-enough clustering. For the example in this issue, if we take ccggca
as the consensus sequence, then all alleles of this subMSA have a hamming distance of 1 to the consensus sequence, making the cluster a one-ref like, and thus not needing to cluster further, and we would then simply list all alleles. Unfortunately ccggca
is not the chosen consensus sequence. We instead choose gcggca
simply because we choose the consensus sequences on a non-deduplicated MSA, and this sequence is present much more frequently than the other:
Counter({'GCGGCA': 116, 'CCGGCA': 32, 'ACGGCA': 1, 'CAGGCA': 1, 'CCGGCG': 1})
With gcggca
as consensus sequence, then ccggcg
and caggca
have a hamming distance of 2 to the consensus, making us wanting to cluster further. However, clearly ccggca
is a better consensus sequence in this case, as it has hamming distance of 1 to all sequences in the MSA. So an easy fix would be to simply deduplicate the sequences when finding this consensus sequence: https://github.com/iqbal-lab-org/make_prg/blob/c2c7ff0f40dabc5034e2bfad7d6c2229e5649b09/make_prg/from_msa/cluster_sequences.py#L101
However, I think that, in general, whenever we are working with a (sub-)MSA, we can safely deduplicate the alleles. This would not only solve this issue, but also any other future issue with duplicated alleles, and will also result in a much better performance, as the gene MSAs in practice have very similar regions. I have just implemented this in this branch (see https://github.com/iqbal-lab-org/make_prg/commit/71b90f39e6e42e51e08e7d853c4ac11faa743336), and tested on this katG
example and it worked. However, this minimal fix is all I could do for now, I will be able to go through this code in more details and add tests just later this week (34 out of 481 tests are failing in this branch, but I think is because we are not creating the exact same PRGs are previously - tests need update, not because this is bugged)... But it would be nice if you could test this fix, and if it would solve your missing-calls issue in drprg
Thanks for the detailed response. All of this makes sense. Deduplicating the alleles does seem like the better way to approach things I think.
If you tested it out on that katG example then I think that's fine. I'll have a dig through my results and see if I can find any other cases of this and test out your branch. Thanks a lot Leandro!
Hmmm, so this has popped it's head up a number of times now in pncA and rpoB and gid. Any idea when there might be a binary available with your fix @leoisl so I can test it out?
Hey @mbhall88 , you can download the binary here: https://www.dropbox.com/s/nl5q3zvmoaj86is/make_prg_0.4.1?dl=1 . Note that this is only the binary with my changes in this branch to try to fix this issue. I haven't had time to make unit tests and make this a stable version and properly release it, so be aware these changes are still untested... but I guess you will get a feeling if this is working given that you have examples where this issue happens and you can do a sort of regression test looking at the precision-recall of drprg using this new make-prg version.
Sorry for not being able to release this, all my time is being devoted to roundhound as we have some 3 deadlines coming up, with the earliest one being in 7 days, and there are still many things to do... OFC I can reprioritise things WRT what @iqbal-lab thinks should be the priority
I've tested that out on two samples and I still get the same problem.
Here is one of the examples that has a POS duplication at rpoB POS 1389 in /hps/nobackup/iqbal/mbhall/drprg/tmp/predict-ERR551578/ERR551578.drprg.bcf
(the second of those positions should call the alternate allele but it can't because the first position is ref, even though the alternate call on the second position would still be compatible with the ref call on the first position)
make_prg from_msa -t 2 -L 5 -N 5 -v --force -o /hps/nobackup/iqbal/mbhall/drprg/tmp/predict-ERR551578/updated -i /hps/nobackup/iqbal/mbhall/drprg/tmp/predict-ERR551578/update_msas
I have recently tested drprg out after upgrading make_prg from https://github.com/leoisl/make_prg/releases/tag/v0.3.0 to the latest v0.4.0
The results were almost the same across ~8500 samples. However, there were 3 that have gone from being TPs to FNs.
It has to do with a region in the gene gid where the positions have weird overlaps that differ between make_prg versions. This is all a bit easier if I put in some examples
Here is the region from https://github.com/leoisl/make_prg/releases/tag/v0.3.0 (this leads to a correct ALT call in POS 513)
However, this is the same region with the latest v0.4.0 make_prg
you can see the POS 511 variant is filtered due to FRS - there is coverage on the ref and alt, where the same variant (POS 513) in the previous version has no coverage on the ref.
The second variant in those examples overlaps both the first and third variants.
This 513C>T variant is discovered by denovo in both cases - i.e. the reference PRG is effectively just the first two variants
In both cases I am using pandora v0.10.0-alpha.0
The make_prg command is:
For the v0.3.0 version (you can find all the drprg/pandora/make_prg files at
/hps/nobackup/iqbal/mbhall/drprg/tmp/predict-ERR2510499
)and for the v0.4.0 version (all files are at
/hps/nobackup/iqbal/mbhall/drprg/paper/results/amr_predictions/drprg/illumina/PRJEB25972/SAMEA1101807/ERR2510499/
)Let me know if you need any other info @leoisl