iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

We may need kmer indexing for higher k to enable acceptable quasimapping #124

Closed iqbal-lab closed 6 years ago

iqbal-lab commented 6 years ago

We know for Plasmodium that the value of indexing kmers increases dramatically once you hit k=11 or so. Here is the plot showing (for the old P. falciparum whole genome PRG, which only had 40,000 sites or so) how the number of suffix array intervals changes as we extend a pattern match. We see that although the median is not big, the max can be enormous, and we know this heavily impacted quasimapping performance in Sorina implementation . There are 11mers overlapping 60,000 different sites in that P. falciparum PRG.

sa_intervals_fig1

This is genome dependent - here is the plot for a PRG with the same structure (number of sites, gaps between them, and number of alleles for each site, and sequence of the alleles) but placed on a chunk of the human genome instead on p. falciparum

sa_intervals_fig2_human

The systematic thing to do for a new PRG is to build these plots and choose a good k.

The upshot of all this is quasimapping is going to be very heavily impacted by restricting to very short kmers for P. falciparum, or conversely, if we can increase k to 11 or 12, we get a big boost.

Options, which I am stating, not advocating/pushing. Some are alternatives, and some could be combined

  1. Increase the threshold for indexing all possible kmers up from 7 to 12 or so.

  2. Allow the user to specify whether to index all (I don't like this option too much but it would allow me to explore the performance benefits of higher k)

  3. For a whole genome PRG, discard the property of ignoring a read which does not overlap the kmer index. Having done this, index some easy subset of the kmers (eg all kmers overlapping allele 0 of each site - ie the reference allele, or maybe two alleles for each site or whatever), and then during mapping add new kmers to the index.

Note Right now I possibly cannot benchmark the benefit of indexing at higher k if the current indexing is too slow. I'm in the process of benchmarking (https://github.com/iqbal-lab-org/gramtools/issues/117) but I expect not to be able to go above k7, whereas the graph above suggests there will be significant value in going higher.

ffranr commented 6 years ago

@iqbal-lab Do you agree that this issue can be closed for now given the recent changes (see: https://github.com/iqbal-lab-org/gramtools/commit/42018b4dc1033342127a9bf247e6aab2eecd5c91)?

iqbal-lab commented 6 years ago

yes