Investigate low recall for pandora variant calls

mbhall88 commented 3 years ago

As outlined in https://github.com/mbhall88/head_to_head_pipeline/issues/48#issuecomment-692496911 pandora has very low recall on some samples. This issue will document the investigation of the cause for this. It will focus (at least initially) on sample mada_104 (dense PRG) as it has a recall (unfiltered) of 0.607698

mbhall88 commented 3 years ago

Sample mada_104 has nanopore coverage of 36.48x

mbhall88 commented 3 years ago

how many loci did pandora miss?

pandora consensus

$ fastaq count_sequences /hps/nobackup/research/zi/projects/tech_wars/analysis/pandora_variants/nanopore/genotype/dense/madagascar/mada_104/pandora.consensus.fq.gz
2558

loci references

$ fastaq count_sequences /hps/nobackup/research/zi/projects/tech_wars/data/H37Rv_PRG/prgs/reference_loci/loci_reference.fa
2582

Summary

pandora missed 24 (0.93%) loci

Note: I looked at a couple of other samples and the finding is much the same

mbhall88 commented 3 years ago

If we take a look at a small cluster of FNs (called FPs in the varifier VCF) we can see that pandora has clearly missed something here. The top two tracks (gray and aqua) are the FP variants. The middle track is the pandora consensus sequences mapped to H37Rv (showing the genes were called by pandora), and the bottom track is the nanopore reads mapped to H37Rv.

Screenshot from 2020-09-16 12-21-22

mbhall88 commented 3 years ago

I have figured out why we don't find these variants (sort of).

Warning: this is me dumping all of my notes and thoughts from the day so I don't forget them. I've tried to make it clear but apologies if it is confusing.

Context

In the loci IGR:466668-468334 there are 8 variants we do not discover. These variants have the following coordinates and coverage along the max. likelihood path

id	ref_pos	loci_pos	covg
1	467516	848	0
2	467526	857	0
3	467546	877	0
4	467564	895	0
5	467585	916	0
6	467590	921	0
7	467621	952	0
8	467638	969	1

Candidate regions

Five candidate region intervals are triggered by discover

[0, 31) - no read overlaps are detected for this interval
[893, 950) - 22 read overlaps, but no de novo path is produced as it can't find a valid start/end anchor kmer combination in the de Bruijn graph (more on this later)
[1345, 1390) - 1 read overlap, but GATB could not create a dBG as there were "no solid kmers"
[1412, 1464) - 19 read overlaps, but no de novo path is produced as it can't find a valid start/end anchor kmer combination in the de Bruijn graph
[1636, 1688) - 11 read overlaps, found 1 de novo path

Only the candidate region [893, 950) covers any of our 8 variants (4, 5, and 6).

Three parameters control the classification of candidate regions:

Minimum required coverage at a base in max. likelihood path = 2
Minimum length of bases below the above coverage = 1
Maximum length of bases below the above coverage = 20

When we create a candidate region, we also add padding to the interval - 2 x de novo kmer size (22). So, for example, the candidate region [893, 950) was triggered by the interval [(893+22), (950-22)) - i.e. [915-928). So this candidate region actually covers variants 5 and 6. Here is where the problems start...

Variants 1 and 2 sit in a zero-coverage interval [833, 858) (length 24bp). So this interval exceeds the max. length of 20 - i.e. no candidate region.
Variants 3 and 4 sit in a zero-coverage interval [873, 897) (length 23bp). So this interval exceeds the max. length of 20 - i.e. no candidate region.
Variants 5 and 6 sit in a zero-coverage interval [915, 928) (length 12bp). So this interval meets all three criteria and is designated as a candidate region.
Variants 7 and 8 sit in a low-coverage (less than 2) interval [948, 972) (length 23bp). So this interval exceeds the max. length of 20 - i.e. no candidate region.

So there are two main issues here:

Our maximum length is potentially too short.
We probably need to merge candidate regions within a certain distance (past-us was aware of this)

The reason we need to merge is that if we solve issue 1 by increasing the maximum length we are still left with the problem that if the padding takes the start/end anchor kmers into a neighbouring low coverage region we won't find a start/end combination.

One thing I haven't figured out yet is why the interval between variants 5 and 6 ([915, 928)) couldn't find a start/end anchor kmer combination. Looking at the coverages, there should be plenty of end kmers that exist in the dBG and a handful of start kmers too... I'll have to dig into this tomorrow.

If you made it to here, I salute you!

leoisl commented 3 years ago

Amazing investigation!!

That all makes sense, and currently denovo seems to be suitable to find isolated-ish small variations, due to the length constraint. I do agree with increasing maximum length and merging candidate regions, but that might make the DBG fairly more complex. If needed, we can optimise that. I can see we can optimise well if we restrict ourselves to try to find just SNPs or small indels. For large indels, things get a bit tough... We can talk over this if DBGs indeed get complex

mbhall88 commented 3 years ago

Today's task was to try and figure out why de novo found no paths for [893, 950)

It turns out de novo did find start/end anchor kmers for this interval! The issue was that after building the DFS tree (starting at the start kmer) the end kmer did not exist in the DFS tree - i.e. there is no path between the two in the dBG.

When we build the dBG we have a constraint of only using kmers with an abundance of 2 or more. I reduced this to 1 and got 6 paths for this candidate region.

Each of the 6 paths (correctly) finds variant 6 and 4 paths also find variant 5. However, if I update the PRG with these de novo paths we do get an ALT allele with these variants in it, but this ALT introduces 2 other errors. So we gain 2 in recall and lose 2 in precision. I don't think reducing the abundance requirement on the dBG is a good idea going forward.

Next, I tried using a de novo kmer size of 9 (instead of 11), but we are still unable to find a path between the start and end kmers.

If I set the maximum length of candidate regions to 30 I get 8 candidate regions that cover all 8 variants we are looking for. However... 2 of those candidates can't find a path from start to end, one can't find any start kmers, and another produces 3 paths (I haven't checked if these have the variants in them).

After discussing all of this with Zam there are two things to do from here:

Merge candidate regions - the small gaps between these variants in our example is causing problems where we cant get good anchor kmers and find paths between them
Increase the density of our PRG to provide a stronger prior to work with.

All 8 of these variants exist in the CRYPTIC VCF so if we add more samples hopefully we can get some of them into the dense PRG.

mbhall88 commented 3 years ago

After implementing merging of candidate regions in https://github.com/mbhall88/pandora/commit/6c00b971620a1d7ecd4ad95468bd06730397297d

I set the maximum candidate region length (before merging) to 76 and merge candidate intervals within 22 of each other (2 x de novo k-size).

There were 8 candidate intervals before merging, and 5 after. All of the 8 variants of interest were merged into a single interval [811, 994).

However, there was still no path between any start/end anchor kmers in the DFS tree.

After I went down the rabbit hole of building cortex graphs and visualising them with coverage I realised there was some low coverage links that could break the graph.

SO, after reducing the minimum abundance from 2 to 1 when building the de Bruijn graph with GATB, we get 6 paths for this interval with the 8 variants in it.

After adding these 6 paths back into the PRG and then genotyping, we recover all 8 variants correctly!!. We even call the variant that was missed by varifier/the assembly. We do however, introduce a 1bp insertion. But that is a pretty good trade!

I will now rerun the varifier analysis for the eight samples with pacbio assemblies and see how these changes impact precision and recall.

leoisl commented 3 years ago

From what I understand, changing the minimum abundance from 2 to 1 means we retain all kmers from the mapped reads. Kmers appearing only once are very likely to be sequencing errors in the genome assembly context. The denovo assembly might be a slightly different use case, as we are assembling reads that we have a hint that map to a given region. And even if we assemble incorrect paths, regenotyping them will genotype towards the correct ones. I would not recommend reducing the minimum abundance to 1, but it might be that our mapping is a bit strict, might be hard to have 2+ reads overlapping all kmers of the correct path, and thus not breaking the graph, making it impossible to find the path. Even if it seems counter intuitive to allow more erroneous kmers in the graph, I think it does make sense due to the regenotyping, and this result speaks by itself.

I am really interested to know the impact of this in general precision/recall. I will want to try on the E. coli evaluation during paper review (denovo gets a ~3% recall increase now, it could be a lot more)!

Really nice work!

mbhall88 commented 3 years ago

After rerunning with the fix in #54 and merging candidate regions in https://github.com/rmcolq/pandora/pull/234 this can be considered closed now

mbhall88 / head_to_head_pipeline