iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
107 stars 14 forks source link

Make release for racon denovo branch #310

Closed leoisl closed 1 year ago

leoisl commented 1 year ago

Merged into main, but still need to do some small fixes and make the release

mbhall88 commented 1 year ago

What small fixes need to be done?

iqbal-lab commented 1 year ago

Wasn't it the changelog and examples

leoisl commented 1 year ago

yes, + precompiled binaries, conda recipe update, and the release itself. Conda might auto update seeing we have a new release, but I am not sure the auto update will work. I would actually prefer not to have this version on conda, as it is not stable enough yet I guess

mbhall88 commented 1 year ago

Conda won't trigger on pre-releases.

leoisl commented 1 year ago

Cool!

leoisl commented 1 year ago

I've found an issue when running with the tiny example we have for this release. Not an issue with the changes in this release I believe, so I think we can proceed with the release. Description of the issue follows:

The tiny example describes new alleles that pandora denovo should find, and one sample should genotype to one allele, and the other sample to the new allele. For the example gene GC00010897, the new allele is a T in position 44, and this is what we should find.

GC00010897  44  .   C   T   .   .   SVTYPE=SNP;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF  1:0,11:0,16:0,11:0,16:0,23:0,32:1,0:-220.34,-8.03511:212.304    0:22,0:18,0:22,0:18,0:44,0:37,0:0,1:-2.87264,-270.207:267.334

New pandora version finds this:

GC00010897  1   .   ATGCAGATACGTGAACAGGGCCGCAAAATTCAGTGCATCCGCACCGTGTACGACAAGGCCATTGGCCGGGGTCGGCAGACGGTCATTGCCACACTGGCCCGCTATACGAC  ATGCAGATACGTGAACAGGGCCGCAAAATTCAGTGCATCCGCATCGTGTACGACAAGGCCATTGGCCGGGGTCGGCAGACGGTCATTGCCACACTGGCCCGCTATACGAC  .   .   VC=PH_SNPs;GRAPHTYPE=SIMPLE GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF  1:19,20:21,23:26,26:28,27:331,369:372,422:0.235294,0.111111:-221.046,-198.019:23.0274   0:30,26:27,23:32,32:30,29:515,471:465,428:0,0.166667:-228.672,-276.438:47.766

Notes:

  1. Both records describe the same variant, C -> T at pos 44, but the new version outputs this as a 110-bp PH_SNP, where in truth is actually just a single SNP. So the new version is not being as compact as it should;
  2. Although both samples in both versions agree on their genotyping, in the new version we lowered the confidence by an order of magnitude (212 and 267 GT conf for the previous version, 23 and 47 GT conf for the new version). This will probably change genotyping if downstream tools filter on GT Conf.

For some reason, on the previous version, we just have a 1-bp bubble describing the SNP to be genotyped in the PRG string, flanked by long linear sequences:

ATGCAGATACGTGAACAGGGCCGCAAAATTCAGTGCATCCGCA 5 C 6 T 5 CGTGTACGACAAGGCCATTGGCCGGGGTCGGCAGACGGTCATTGCCACACTGGCCCGCTATACGAC

In the new version, we have a long 110-bp bubble:

 5 TTeGAGTAAAACAATCCCCCGCGCTTATATAAGCGCGTTGATATTTTTAATTATTAACAAGCAACATCATGCTAATACAGACATACAAGGAGATCATCTCTCTTTGCCTGTTTTTTATTATTTCAGGAGTGTAAACACATTTTCCG 6 TTGAGTAAAACAATCCCCCGCGCTTATATAAGCGCGTTGATATTTTTAGTTATTAACAAGCAACATCATGCTAATACAGACATACAAGGAGATCATCTCTCTTTGCCTGTTTTTTATTATTTCAGGAGTGTAAACACATTTTCCG 5 

make_prg versions haven't changed (AFAIK), and the discovered pandora denovo alleles for this gene are exactly the same. Although the graphs are conceptually the same, i.e. represent the same information, one allows for a very compact description of the variants, while the other not. However, coupled with an issue on calculating likelihoods of PH_SNPs in pandora, we could actually get different results. From this execution, there are two tasks we should carry out:

  1. Fix the issue of GT conf decreasing the longer alleles are. What happens in this example is that, when genotyping, the coverage that matters, which is the coverage on position 44 where we actually have the SNP, gets averaged together with the coverage of the 109 other identical bps, i.e. this SNP signal gets diluted. One fix for this would be to post-process the variant and break it into smaller, more compact ones. In this example, we should trim off the 109 identical bps on both alleles, and keep just the SNP itself. This would allow us to genotype the SNP based just on the coverage of the bases that differ, so the coverage of the 109 identical bps on both alleles would not get on the way. This also happens with indels and complex variants, although these (especially the latter) might be harder to break;
  2. Debug make_prg and understand why it created different graphs from presumably same input. From make_prg perspective, however, this is not a bug - both graphs represent the same information. However, we can miss variants due to this different representation of the PRG and the issue described in (1).

I believe this is not an issue to drprg though, as it does not use make_prg update.

Did I miss some task that we should do?

Do you agree we can release even with this known issue?

Which would be the path you'd recommend to solve or investigate this?

mbhall88 commented 1 year ago

This has to be a make prg issue right? Given the discovered allele was a single base? Might be a bug/regression in update? I don't really know where you v1.0.0 of make_prg is coming from though. I did see some strange allele additions back when I was using update in drprg remember - but I'm using this version of make prg https://github.com/mbhall88/drprg/blob/533f7b36084dbb269d511e9146f5cadae021a7b1/justfile#L7. It surely can't be a pandora issue as pandora doesn't build PRGs...

leoisl commented 1 year ago

This has to be a make prg issue right?

I think so...

I don't really know where you v1.0.0 of make_prg is coming from though.

Neither I at this point. I think from https://github.com/leoisl/make_prg/tree/update_1_0_0_pre_release , but I am not sure. Anyway both mine and your version are outdated. make_prg is partially merged, we are working with 2 different versions that are not the most updated one.

This is too messy to manage and really an issue. As I de-prioritised make_prg due to new projects with higher priorities (tbpore, roundhound, mof and now pandora), I am re-prioritising make_prg now and just resuming other tasks once we have make_prg merged and released. Otherwise this will become even messier and the cost to fix will be higher.

leoisl commented 1 year ago

FYI my container refer to this commit: https://github.com/leoisl/make_prg/commit/b064dcdf4da72df3228cbc993018ac50da6105f0 . But this does not matter much, as we will be releasing new make_prg version in the next couple of days, so we will all sync to use the new version

iqbal-lab commented 1 year ago

BTW for this https://github.com/rmcolq/pandora/issues/310#issuecomment-1322417386 In the second enumeration, item 1, where you talk about the SNP signal being diluted , that's why Brice has his fancy model for gramtools which handles this. Porting it to Pandora is on the road map somewhere.

I'd ignore the gcp issue , the real question is why a single snp is suddenly becoming a long thing

leoisl commented 1 year ago

Done