mbhall88 / drprg

Drug Resistance Prediction with Reference Graphs
https://mbh.sh/drprg/
MIT License
19 stars 1 forks source link

Switch to pandora racon denovo #17

Closed mbhall88 closed 1 year ago

mbhall88 commented 1 year ago

The pandora discover process will now use the pandora version from https://github.com/rmcolq/pandora/pull/299

Before merging I will add updated results to see the impact.

Before this update, the latest sensitivity/specificity results can be seen below

Illumina

index

Nanopore

index

codecov-commenter commented 1 year ago

Codecov Report

Merging #17 (22808aa) into main (a51125d) will decrease coverage by 2.45%. The diff coverage is 51.45%.

@@            Coverage Diff             @@
##             main      #17      +/-   ##
==========================================
- Coverage   90.50%   88.04%   -2.46%     
==========================================
  Files          11       12       +1     
  Lines        4897     5196     +299     
==========================================
+ Hits         4432     4575     +143     
- Misses        465      621     +156     
Impacted Files Coverage Δ
src/config.rs 0.00% <0.00%> (ø)
src/main.rs 0.00% <ø> (ø)
src/lib.rs 76.70% <37.02%> (-6.02%) :arrow_down:
src/predict.rs 85.13% <68.00%> (-0.07%) :arrow_down:
src/builder.rs 83.64% <75.00%> (-0.82%) :arrow_down:
src/panel.rs 98.60% <100.00%> (ø)
tests/main.rs 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

mbhall88 commented 1 year ago

I have regenerated the sensitivity and specificity plots with this branch now. See below for update plots


Illumina

image

We have the following changes in results:

Nanopore

image


Illumina results are great, but I need to figure out what the source of the FP problem is in the nanopore data before merging this.

leoisl commented 1 year ago

Illumina results are great, but I need to figure out what the source of the FP problem is in the nanopore data before merging this.

I am not even sure if illumina results are great, looking at the plots and 5 less FNs and 1 extra FP for INH, it seems to me I'd prefer have 1 less FP.

Anyway, nanopore results looks bad indeed. And looking at these PR changes, it seems to me the only change is pandora denovo -> pandora racon, i.e. pandora racon is causing this effect. Although results are bad, it is good we have a very concrete example where we do bad, so that we can debug and improve. The fact that we do better (or at least similarly) in illumina might hint that we are being too lenient when calling a difference as truth when running pandora racon on ONT data. We might need stricter filters. The only thing we do actually is to run minimap2 with the -sr or -map-ont flag if illumina or ONT reads are given. We don't change any Racon parameter, as they don't hint that anything actually has to be changed in their CLI help. It might be also the case that we interrupted Racon too early, we just run it for a max of 10 rounds, it might need 20 or 30 rounds to correct this ONT data...

I think one way to go through this debugging is to get one or two genes where we get most FPs from e.g. INH (3 less FNs and 57 extra FPs for INH), and I am hoping you have truth sequences for these genes. We run pandora discover with the -K flag so we know exactly the reads and the ML sequence we give to Racon, and also the number of rounds and the final corrected sequence we got. Then we can figure out how we can change minimap2/racon parameters to remove these FPs. I am happy to do that if you want, I just need the genes, their truth sequences, the reads and PRGs.

iqbal-lab commented 1 year ago

@leoisl - the illumina results are definitely acceptable for nanopore the obvious questions are

With racon on nanopore for covid, demanding >=50% of reads agree with the racon consensus gave excellent results, but we still needed to basically ignore homopolymer calls.

leoisl commented 1 year ago
  • has the racon-window-soize been changed accidentally, we know it must be == the locus length

Window length is locus length + 100 (see https://github.com/leoisl/pandora/blob/e6fe86c975c2a813129e9edcab39a5320ae6de9a/src/denovo_discovery/racon.cpp#L17, Racon::racon_window_padding == 100). I think it has to be at least the locus length, I give a padding just to be safe. I am almost sure I copied this from @martinghunt from one of his repos, including the padding, could you please confirm?

With racon on nanopore for covid, demanding >=50% of reads agree with the racon consensus gave excellent results, but we still needed to basically ignore homopolymer calls.

I am ok with doing a pileup to filter racon corrections. However, we should not need this. We are free to add potentially bad new alleles to the PRG, pandora then should just not genotype towards them. Systematic (e.g. homopolymer) errors might be the exception to this... I also think we need to filter them out. Should I start sth on this direction, or wait to confirm in the data that homopolymer errors are a significant issue?

iqbal-lab commented 1 year ago

no am not asking you to do this now, i agree we're adding candidates to the graph for later genotyping, am giving information to @mbhall88 , he will be filtering somewhere in drprg.

martinghunt commented 1 year ago

I am almost sure I copied this from @martinghunt from one of his repos, including the padding, could you please confirm?

Yes. I Added in some extra in case the sequence got corrected by racon and made longer, so that subsequent iterations would have long enough window size.

mbhall88 commented 1 year ago

Okay, first, some background. A FP here is a resistance call in a sample where the phenotype is susceptible. We have 4 genes where any frameshift is considered to confer resistance. Those gene/drugs are katG/INH (isoniazid), gid/STM (streptomycin), ethA/ETO (ethionamide), pncA/PZA (pyrazinamide). (Interestingly we only have a FP issue with 3/4 of those genes/drugs)

I did not check all FPs, but I check 10-15 FPs for each drug and there was a consistent theme. All newly introduced FPs I saw were 1bp deletions in either gid, ethA, or katG.

I use a fraction of read support filter (FRS) in drprg, and this is currently set to 0.51. However, these FPs did not seem to be low FRS, they were a mixed bag, with some even being up around 0.88. Additionally, there does not seem to be a trend with strand bias in these FPs either.

The nanopore samples I am running on have been basecalled with a mixture of guppy versions also, so they are varying degrees of accuracy. Interestingly, nearly all of the FPs are from samples that come from the New York nanopore dataset (as opposed to our head to head dataset) and this was basecalled with an older version of guppy (I can't figure out which version). I am almost certain it was before v3.6.0 which is when we saw a big improvement in homopolymer errors.

I have an example of a sample which has a FP in each of those 3 genes, so that is probably a good place to start debugging.

I have created a temp folder with the results and debugging files in /hps/nobackup/iqbal/mbhall/drprg/paper/tmp/SRR12395058

To generate the files in there, I ran

uri=/hps/nobackup/iqbal/mbhall/drprg/paper/.snakemake/singularity/eb17a1832f81b36c31e88e77ed74458c.simg
BINDS="/tmp,$HOME"
BINDS+=",/hps/scratch,/hps/nobackup/iqbal,/nfs/research/zi,$FASTSW_DIR"
singularity exec --contain -B "$BINDS" $uri drprg predict --sample SRR12395058 --verbose --ignore-synonymous  -d 3 -b 0.01 -g 5 -L 20 -K 0.51 -i /hps/nobackup/iqbal/mbhall/drprg/paper/results/filtered/nanopore/PRJNA650381/SAMN15703082/SRR12395058/SRR12395058.filtered.fq.gz -o /hps/nobackup/iqbal/mbhall/drprg/paper/tmp/SRR12395058 -x /hps/nobackup/iqbal/mbhall/drprg/paper/results/drprg/index/w14/k15 -t 4 --debug 

which runs pandora discover with

pandora discover -g 4411532 -v -o /hps/nobackup/iqbal/mbhall/drprg/paper/tmp/SRR12395058/discover -t 4 -w 14 -k 15 -c 10 -K /hps/nobackup/iqbal/mbhall/drprg/paper/results/drprg/index/w14/k15/dr.prg /hps/nobackup/iqbal/mbhall/drprg/paper/tmp/SRR12395058/query.tsv

where the query TSV just points at the input to drprg /hps/nobackup/iqbal/mbhall/drprg/paper/results/filtered/nanopore/PRJNA650381/SAMN15703082/SRR12395058/SRR12395058.filtered.fq.gz

You can find the output from discover in /hps/nobackup/iqbal/mbhall/drprg/paper/tmp/SRR12395058/discover

Another important file in this output is /hps/nobackup/iqbal/mbhall/drprg/paper/tmp/SRR12395058/SRR12395058.drprg.bcf, which is the filtered VCF drprg makes it's predictions from and /hps/nobackup/iqbal/mbhall/drprg/paper/tmp/SRR12395058/SRR12395058.drprg.json, which is the predictions themselves.

The recurring indels in the problem genes that this sample has are (with reference sequence context in parenthesis - 5bp either side of position)

You can find the reference sequences in /hps/nobackup/iqbal/mbhall/drprg/paper/results/drprg/index/w14/k15/genes.fa. FYI I have added 100bp flanking to each sequence. The VCF position is w.r.t the padded sequence. So POS 101 is POS 1 in the non-padded version of the gene.

With racon on nanopore for covid, demanding >=50% of reads agree with the racon consensus gave excellent results, but we still needed to basically ignore homopolymer calls.

@iqbal-lab I'm interested in exactly how this is done. Can you explain in more detail or point me at some code?

leoisl commented 1 year ago

Hey @mbhall88 , I got into debugging your run because it gives us completely what we need for debugging, so is a great example. After looking at some IGV plots for katG (I chose this as the first gene to be studied), I thought there was a big bug with pandora. In the end, I copied the files from codon to my local and ran pandora locally, and I got different results than your run. For example, in my local, these are the variants I find for katG:

1 denovo variants for this locus
467 G   

And these are the variants your run found for katG (from /hps/nobackup/iqbal/mbhall/drprg/paper/tmp/SRR12395058/discover/denovo_paths.txt):

9 denovo variants for this locus
184     C       
467     G       
959     GA      AG
963     CCG     
994     G       A
1699    G       
1882    C       
2126    G       A
2410    G       A

i.e. your run found 8 more variants than my run, including the single variant of my run. So, I looked at the SAM file of your run, which includes in the header the command line used to produce it (in /hps/nobackup/iqbal/mbhall/drprg/paper/tmp/SRR12395058/discover/SRR12395058/SRR12395058.filtered.sam):

@PG     ID:pandora      PN:pandora      VN:0.9.2        CL: /hps/nobackup/iqbal/mbhall/drprg/src/ext/pandora discover -g 4411532 -v -o /hps/nobackup/iqbal/mbhall/drprg/paper/tmp/SRR12395058/discover -t 4 -w 14 -k 15 -c 10 -K /hps/nobackup/iqbal/mbhall/drprg/paper/results/drprg/index/w14/k15/dr.prg /hps/nobackup/iqbal/mbhall/drprg/paper/tmp/SRR12395058/query.tsv 

Matching md5sums from the binaries we produced and linked on https://github.com/rmcolq/pandora/pull/299, I found out that the the pandora executable drprg is using (/hps/nobackup/iqbal/mbhall/drprg/src/ext/pandora) has the same md5sum of this binary: https://github.com/rmcolq/pandora/pull/299#issuecomment-1286309891, from 19 days ago. Unfortunately this binary is missing a bunch of fixes, especially these two big bugfixes: https://github.com/rmcolq/pandora/pull/299#issuecomment-1296898504 . I think we might still have some FPs due to the frameshift the latest pandora version finds for katG, but I think these will be much fewer. Could you please retry drprg with the latest pandora binary in that PR: https://github.com/rmcolq/pandora/pull/299#issuecomment-1298300884 and see how the results change?

mbhall88 commented 1 year ago

Sorry, I had run the local version of drprg to test out a small change in the logging. I've rerun with the container (as in the example in my previous comment). There is indeed only the one variant in katG now (still the same FPs).

leoisl commented 1 year ago

So just to confirm what I understood: these results are still the same even with the latest pandora binary:

6 less FNs and 26 extra FPs for ETO
3 less FNs and 57 extra FPs for INH
1 extra FP OFX
1 less FN and 1 extra FP PZA
3 less FNs and 1 less FP for RIF
3 less FNs and 21 extra FPs for STM

and you have rerun drprg in such a way the paths in this comment: https://github.com/mbhall88/drprg/pull/17#issuecomment-1306452216 now refer to this new run with the latest pandora?

iqbal-lab commented 1 year ago

One extra datum. Nearly all FPs are in samples which were basecalled with an old version of Guppy before v3 (or done specificversion I forget), and we (in other work of Michael's) know that prior to that version the data was a lot worse

mbhall88 commented 1 year ago

@leoisl yes, the plots are from the version of pandora you linked in https://github.com/rmcolq/pandora/pull/299#issuecomment-1298300884. They were all run in container docker://quay.io/mbhall88/drprg:1b091eb which is locally at /hps/nobackup/iqbal/mbhall/drprg/paper/.snakemake/singularity/eb17a1832f81b36c31e88e77ed74458c.simg (it is the one in the comment where I give the example command)

Yesterday I made a small change to drprg to add logging of the exact command run for each external tool and I just tested it out on that one sample locally, that's why those files were different. I have regenerated all the files with the version in the container and ocnfirmed the md5sum of the pandora binary in the container matches the glibc one you linked

mbhall88 commented 1 year ago

One extra datum. Nearly all FPs are in samples which were basecalled with an old version of Guppy before v3 (or done specificversion I forget), and we (in other work of Michael's) know that prior to that version the data was a lot worse

Correct, we know that from guppy v3.6.0 onwards the homopolymer delettions improved a lot. However, pandpra/drprg does a lot worse than tbprofiler and mykrobe here, would be nice if we could do at least the same as them

iqbal-lab commented 1 year ago

Hmm. Mykrobe uses a higher k, might mean fewer repeated wrong kmers?

mbhall88 commented 1 year ago

Sorry, baby brain. Mykrobe does also call most of these FPs, but tbprofiler doesn't.

leoisl commented 1 year ago

I think these are not pandora racon denovo issue, but a pandora genotyping issue. What I mean is that for these 1-bp deletions to cause FP in the evaluation:

gid 479 CAATTG>CACTG (AGGACCAATT)
gid 446 446 CG>C (TCGTGCGGGG)
katG 466 CG>C (GGCGCCGGGG)
ethA 1543 CA>C (TCGCCAAAAG)
ethA 1560 TG>T (CCGGTGGGGG)

not only they were called by racon, but also by pandora later when genotyping. pandora racon denovo should act just like pandora dbg denovo: it should find as many new variants as possible, not necessarily correct variants. I am now even thinking on making racon parameters less strict. Both denovo methods are expected to find wrong variants, and that is ok, but pandora should not genotype towards a wrong one. Put in another way, if for some reason these 1-bp deletions were in the original MSA/PRG, we would call it with pandora, even without running discovery.

As such, when going for debugging this tomorrow, I will debug as a pandora genotyping issue.

mbhall88 commented 1 year ago

You are indeed correct. It might just be that that is that and we just can't deal with nanopore homopolymer indels. Don't spend all day on it

leoisl commented 1 year ago

A quick look at the VCF record for the katG variant call, data just shows that the katG variant should be indeed called:

katG    466     .       CG      C       .       .       VC=INDEL;GRAPHTYPE=SIMPLE       GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      1:0,1:2,12:0,1:2,12:0,2:4,25:1,0:-133.183,-23.8071:109.376

But I also note that the calls come predominantly from the rev. compl. reads: for the called allele, we have a mean of 1 read in the fwd strand, while 12 on the rev strand. Should this be flagged as a strand bias error in drprg and be filtered out? I looked at all the other variants in katG, they have no strand bias, just this one. I didn't look at the other 1-bp deletion errors, if the calls are also biased towards one strand, will look at them tomorrow

mbhall88 commented 1 year ago

Yeah, I looked at the strand bias ratio across about 40 FPs and there wasn't a clear bias. Some had a ratio of ~0.07, but alot were also up near 0.4. I use a strand bias ratio of 0.01 (as in tbpore).

leoisl commented 1 year ago

You are right, the 4 other calls have no strand bias issue. I looked at them in more details:

gid     446     .       CG      C,CGG   .       .       VC=INDEL;GRAPHTYPE=SIMPLE       GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      1:21,17,14:23,28,17:21,17,14:23,28,17:42,35,29:46,57,34:0,0,0.5:-353,-348.512,-433.597:4.48739
gid     479     .       CAATTG  CAATT,CACTG     .       .       VC=INDEL;GRAPHTYPE=NESTED       GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF        2:22,13,8:8,0,36:22,13,8:1,0,36:110,26,17:43,0,72:0,0.5,0:-266.487,-375.379,-201.029:65.4577
ethA    1543    .       CA      C       .       .       VC=INDEL;GRAPHTYPE=SIMPLE       GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      1:7,17:7,8:6,17:7,8:37,17:37,8:0,0:-128.676,-70.254:58.4222
ethA    1560    .       TG      T       .       .       VC=INDEL;GRAPHTYPE=SIMPLE       GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      1:8,13:5,7:4,13:5,7:25,26:15,14:0.333333,0:-120.033,-68.4252:51.6082

The two gid variants are the 2 with lowest GT_CONF (4 and 65) among the 24 gid calls. Only 4 of the calls have GT_CONF < 100. The two ethA variants are among the 3 with lowest GT_CONF (58 and 51) among the 27 ethA calls. Only 4 of the calls have GT_CONF < 100. I think you likely made a parameter exploration, and this included a few GT_CONF thresholding, so I am not sure simply filtering for GT_CONF>=100 can give good results. It might filter out lots of other calls. Might be worth a try though if you haven't tried that. Another approach is to treat indels differently. As we know our data struggles with indel, you might want to have a GT_CONF threshold for indels and another for the other type of variants.

The main issue with GT_CONF thresholding is that I think is not normalised. So if your samples have different coverages, I think a flat value won't fit all. This is where our GT conf percentile work would be great, but for some reason worked well for gramtools, but was terrible for pandora... Maybe is time to revisit that, or maybe we can find a way of post-filtering the pandora VCF based on GT_CONF as is now... or maybe you already tried a bunch of things and filterings, and nothing really improved...

mbhall88 commented 1 year ago

I have a GT CONF filter of 5 in place currently. That was set back when I first developed drprg. As pandora has changed a little since then I can try increasing it and see what happens. Will run now and let you know

mbhall88 commented 1 year ago

Here is the nanopore results with min. GT CONF 100

image

Definitely too harsh

iqbal-lab commented 1 year ago

when i read this

https://github.com/mbhall88/drprg/pull/17#issuecomment-1307975532

my first thought was "why is he calling me "baby brain"???"

leoisl commented 1 year ago

Conclusion on katG variant

I debugged the katG variant:

katG 466 CG>C (GGCGCCGGGG)

and my conclusion is that the data supports this 1-bp deletion, although the confidence of this call should be low. pandora calls this variant with a relatively high confidence (GT_CONF=109):

katG    466     .       CG      C       .       .       VC=INDEL;GRAPHTYPE=SIMPLE       GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      1:0,1:2,12:0,1:2,12:0,2:4,25:1,0:-133.183,-23.8071:109.376

So I think this is a sort of genotyping issue from pandora. We should still call this 1-bp deletion, but with low confidence.

Explanation

1. Viewing what pandora sees in this region

I got the pandora SAM file, extracted the reads that mapped to the katG PRG, remapped to the katG sequence with minimap2 (in order to obtain base-level mapping), and viewed the mapping with IGV. The red rectangle denotes the region of the variant, and the blue rectangle the counts:

image

So we can see that we have 49 reads calling deletion in this locus, while 41 reads call G. So the data pandora sees backup the call for deletion. However, the IGV counts and the pandora counts on the VCF diverge significantly, I would like to explore this further...

2. Viewing what minimap2 sees in this region

It could be that pandora is not mapping reads well to katG, so the previous exploration is biased. So I got all the reads from this sample, mapped to the katG sequence with minimap2, and viewed the mapping with IGV. The red rectangle denotes the region of the variant, and the blue rectangle the counts:

image

minimap2 says 57 reads calling the deletion, while 46 reads call G. This is similar to what pandora says (49 deletion, 41 G), but pandora maps slightly less reads. These are excellent results to me, pandora solves a harder problem than minimap2, and also implements a quasimap, which might lose some reads and not recover them later. So the data minimap2 sees also backup the call for the deletion.

3. Conclusion and further debugging

  1. We should call the deletion here, is what the data backs up. However, this call should have low confidence, given that the difference between the counts of reads calling deletion and reads calling G is very small;
  2. I am happy with the IGV plots from both pandora and minimap2 mappings - they agree and it seems we don't have a bug with pandora mapping in this example;
  3. I am worried that the pandora counts on the VCF for this variant are very different from what we see in the IGV plots. This could be a bug and has to be further explored. This actually seems to me to be a serious issue and can improve our genotyping a lot. I will leave for @iqbal-lab however to decide the priority of this, given all the things we have to address currently in pandora;
  4. This deletion happens in a homopolymer region:
    GCGCCGGGGG
     ^ this first G is what we call for deletion

    which is likely due to the data quality not being great. So I think we have two other solutions for this, that lie outside of pandora:

    • Remove this lower quality data; or
    • Add a flag in drprg to better deal with this type of data. We would then require a very high GT_CONF to call either homopolymer indels or frameshifting indels (not sure what would work better).
  5. I think the proper solution to this would still be to debug why the counts on the IGV plots differ so much from the pandora VCF counts, and implement a GT_CONF percentile, so we can make GT_CONF filtering possible regardless of the samples coverage.
iqbal-lab commented 1 year ago

cc @martinghunt as this kind of thing is his jam at the moment

mbhall88 commented 1 year ago

Wow. Super interesting debugging @leoisl. Awesome job!

Could the difference in the VCF be due to the fact that pandora sees "depth" in terms of median/mean minimizer counts, while the IGV counts are a per-base count? i.e. there might be 41 G's in IGV, but not 41 exact matches of the minimizer the G occurs in?

iqbal-lab commented 1 year ago

Makes sense to me

leoisl commented 1 year ago

Could the difference in the VCF be due to the fact that pandora sees "depth" in terms of median/mean minimizer counts, while the IGV counts are a per-base count? i.e. there might be 41 G's in IGV, but not 41 exact matches of the minimizer the G occurs in?

Yeah, very nice observation, that is the most important question at this stage I guess! As to Zam, what you said makes sense to me. I will continue the debugging tomorrow and try to figure that out. I am hoping is another issue. If indeed is the case that the minimiser-level count is much lower than the base-level count, this means that we can get much better genotyping with a base-level mapping. Another question is if this issue remains on the more recent, and more accurate ONT data, or is something mostly prevalent on the less accurate data...

mbhall88 commented 1 year ago

Another question is if this issue remains on the more recent, and more accurate ONT data, or is something mostly prevalent on the less accurate data...

It's almost exclusively the old nanopore data. I had the same problem with mykrobe in the head to head paper, then I rebasecalled all the data with guppy v5.0.16 and the problem disappeared.

iqbal-lab commented 1 year ago

Yeah, i really think we should drop it and accept it won't work with that old data.

mbhall88 commented 1 year ago

Yeah, i really think we should drop it and accept it won't work with that old data.

Agreed, merging this PR and will exclude old nanopore data from the paper.

iqbal-lab commented 1 year ago

@leoisl , re ". If indeed is the case that the minimiser-level count is much lower than the base-level count, this means that we can get much better genotyping with a base-level mapping"

leoisl commented 1 year ago

Alright! We can keep this in our minds @iqbal-lab if we face another genotyping issue in the future. Unless you want to discuss or work on this issue further, I will focus on the pandora/RH large index issue, which is more pressing.