Closed mbhall88 closed 3 years ago
Sample mada_104 has nanopore coverage of 36.48x
$ fastaq count_sequences /hps/nobackup/research/zi/projects/tech_wars/analysis/pandora_variants/nanopore/genotype/dense/madagascar/mada_104/pandora.consensus.fq.gz
2558
$ fastaq count_sequences /hps/nobackup/research/zi/projects/tech_wars/data/H37Rv_PRG/prgs/reference_loci/loci_reference.fa
2582
pandora missed 24 (0.93%) loci
Note: I looked at a couple of other samples and the finding is much the same
If we take a look at a small cluster of FNs (called FPs in the varifier VCF) we can see that pandora has clearly missed something here. The top two tracks (gray and aqua) are the FP variants. The middle track is the pandora consensus sequences mapped to H37Rv (showing the genes were called by pandora), and the bottom track is the nanopore reads mapped to H37Rv.
I have figured out why we don't find these variants (sort of).
Warning: this is me dumping all of my notes and thoughts from the day so I don't forget them. I've tried to make it clear but apologies if it is confusing.
In the loci IGR:466668-468334 there are 8 variants we do not discover. These variants have the following coordinates and coverage along the max. likelihood path
id | ref_pos | loci_pos | covg |
---|---|---|---|
1 | 467516 | 848 | 0 |
2 | 467526 | 857 | 0 |
3 | 467546 | 877 | 0 |
4 | 467564 | 895 | 0 |
5 | 467585 | 916 | 0 |
6 | 467590 | 921 | 0 |
7 | 467621 | 952 | 0 |
8 | 467638 | 969 | 1 |
Five candidate region intervals are triggered by discover
[0, 31)
- no read overlaps are detected for this interval[893, 950)
- 22 read overlaps, but no de novo path is produced as it can't find a valid start/end anchor kmer combination in the de Bruijn graph (more on this later)[1345, 1390)
- 1 read overlap, but GATB could not create a dBG as there were "no solid kmers"[1412, 1464)
- 19 read overlaps, but no de novo path is produced as it can't find a valid start/end anchor kmer combination in the de Bruijn graph [1636, 1688)
- 11 read overlaps, found 1 de novo pathOnly the candidate region [893, 950)
covers any of our 8 variants (4, 5, and 6).
Three parameters control the classification of candidate regions:
When we create a candidate region, we also add padding to the interval - 2 x de novo kmer size (22). So, for example, the candidate region [893, 950)
was triggered by the interval [(893+22), (950-22))
- i.e. [915-928)
. So this candidate region actually covers variants 5 and 6. Here is where the problems start...
Variants 1 and 2 sit in a zero-coverage interval [833, 858)
(length 24bp). So this interval exceeds the max. length of 20 - i.e. no candidate region.
Variants 3 and 4 sit in a zero-coverage interval [873, 897)
(length 23bp). So this interval exceeds the max. length of 20 - i.e. no candidate region.
Variants 5 and 6 sit in a zero-coverage interval [915, 928)
(length 12bp). So this interval meets all three criteria and is designated as a candidate region.
Variants 7 and 8 sit in a low-coverage (less than 2) interval [948, 972)
(length 23bp). So this interval exceeds the max. length of 20 - i.e. no candidate region.
So there are two main issues here:
The reason we need to merge is that if we solve issue 1 by increasing the maximum length we are still left with the problem that if the padding takes the start/end anchor kmers into a neighbouring low coverage region we won't find a start/end combination.
One thing I haven't figured out yet is why the interval between variants 5 and 6 ([915, 928)
) couldn't find a start/end anchor kmer combination. Looking at the coverages, there should be plenty of end kmers that exist in the dBG and a handful of start kmers too... I'll have to dig into this tomorrow.
If you made it to here, I salute you!
Amazing investigation!!
That all makes sense, and currently denovo seems to be suitable to find isolated-ish small variations, due to the length constraint. I do agree with increasing maximum length and merging candidate regions, but that might make the DBG fairly more complex. If needed, we can optimise that. I can see we can optimise well if we restrict ourselves to try to find just SNPs or small indels. For large indels, things get a bit tough... We can talk over this if DBGs indeed get complex
Today's task was to try and figure out why de novo found no paths for [893, 950)
It turns out de novo did find start/end anchor kmers for this interval! The issue was that after building the DFS tree (starting at the start kmer) the end kmer did not exist in the DFS tree - i.e. there is no path between the two in the dBG.
When we build the dBG we have a constraint of only using kmers with an abundance of 2 or more. I reduced this to 1 and got 6 paths for this candidate region.
Each of the 6 paths (correctly) finds variant 6 and 4 paths also find variant 5. However, if I update the PRG with these de novo paths we do get an ALT allele with these variants in it, but this ALT introduces 2 other errors. So we gain 2 in recall and lose 2 in precision. I don't think reducing the abundance requirement on the dBG is a good idea going forward.
Next, I tried using a de novo kmer size of 9 (instead of 11), but we are still unable to find a path between the start and end kmers.
If I set the maximum length of candidate regions to 30 I get 8 candidate regions that cover all 8 variants we are looking for. However... 2 of those candidates can't find a path from start to end, one can't find any start kmers, and another produces 3 paths (I haven't checked if these have the variants in them).
After discussing all of this with Zam there are two things to do from here:
All 8 of these variants exist in the CRYPTIC VCF so if we add more samples hopefully we can get some of them into the dense PRG.
After implementing merging of candidate regions in https://github.com/mbhall88/pandora/commit/6c00b971620a1d7ecd4ad95468bd06730397297d
I set the maximum candidate region length (before merging) to 76 and merge candidate intervals within 22 of each other (2 x de novo k-size).
There were 8 candidate intervals before merging, and 5 after. All of the 8 variants of interest were merged into a single interval [811, 994)
.
However, there was still no path between any start/end anchor kmers in the DFS tree.
After I went down the rabbit hole of building cortex graphs and visualising them with coverage I realised there was some low coverage links that could break the graph.
SO, after reducing the minimum abundance from 2 to 1 when building the de Bruijn graph with GATB, we get 6 paths for this interval with the 8 variants in it.
After adding these 6 paths back into the PRG and then genotyping, we recover all 8 variants correctly!!. We even call the variant that was missed by varifier
/the assembly. We do however, introduce a 1bp insertion. But that is a pretty good trade!
I will now rerun the varifier
analysis for the eight samples with pacbio assemblies and see how these changes impact precision and recall.
From what I understand, changing the minimum abundance from 2 to 1 means we retain all kmers from the mapped reads. Kmers appearing only once are very likely to be sequencing errors in the genome assembly context. The denovo assembly might be a slightly different use case, as we are assembling reads that we have a hint that map to a given region. And even if we assemble incorrect paths, regenotyping them will genotype towards the correct ones. I would not recommend reducing the minimum abundance to 1, but it might be that our mapping is a bit strict, might be hard to have 2+ reads overlapping all kmers of the correct path, and thus not breaking the graph, making it impossible to find the path. Even if it seems counter intuitive to allow more erroneous kmers in the graph, I think it does make sense due to the regenotyping, and this result speaks by itself.
I am really interested to know the impact of this in general precision/recall. I will want to try on the E. coli evaluation during paper review (denovo gets a ~3% recall increase now, it could be a lot more)!
Really nice work!
After rerunning with the fix in #54 and merging candidate regions in https://github.com/rmcolq/pandora/pull/234 this can be considered closed now
As outlined in https://github.com/mbhall88/head_to_head_pipeline/issues/48#issuecomment-692496911 pandora has very low recall on some samples. This issue will document the investigation of the cause for this. It will focus (at least initially) on sample mada_104 (dense PRG) as it has a recall (unfiltered) of 0.607698