We know for Plasmodium that the value of indexing kmers increases dramatically once you hit k=11 or so. Here is the plot showing (for the old P. falciparum whole genome PRG, which only had 40,000 sites or so) how the number of suffix array intervals changes as we extend a pattern match. We see that although the median is not big, the max can be enormous, and we know this heavily impacted quasimapping performance in Sorina implementation . There are 11mers overlapping 60,000 different sites in that P. falciparum PRG.
This is genome dependent - here is the plot for a PRG with the same structure (number of sites, gaps between them, and number of alleles for each site, and sequence of the alleles) but placed on a chunk of the human genome instead on p. falciparum
The systematic thing to do for a new PRG is to build these plots and choose a good k.
The upshot of all this is quasimapping is going to be very heavily impacted by restricting to very short kmers for P. falciparum, or conversely, if we can increase k to 11 or 12, we get a big boost.
Options, which I am stating, not advocating/pushing. Some are alternatives, and some could be combined
Increase the threshold for indexing all possible kmers up from 7 to 12 or so.
Allow the user to specify whether to index all (I don't like this option too much but it would allow me to explore the performance benefits of higher k)
For a whole genome PRG, discard the property of ignoring a read which does not overlap the kmer index. Having done this, index some easy subset of the kmers (eg all kmers overlapping allele 0 of each site - ie the reference allele, or maybe two alleles for each site or whatever), and then during mapping add new kmers to the index.
Note
Right now I possibly cannot benchmark the benefit of indexing at higher k if the current indexing is too slow. I'm in the process of benchmarking (https://github.com/iqbal-lab-org/gramtools/issues/117) but I expect not to be able to go above k7, whereas the graph above suggests there will be significant value in going higher.
We know for Plasmodium that the value of indexing kmers increases dramatically once you hit k=11 or so. Here is the plot showing (for the old P. falciparum whole genome PRG, which only had 40,000 sites or so) how the number of suffix array intervals changes as we extend a pattern match. We see that although the median is not big, the max can be enormous, and we know this heavily impacted quasimapping performance in Sorina implementation . There are 11mers overlapping 60,000 different sites in that P. falciparum PRG.
This is genome dependent - here is the plot for a PRG with the same structure (number of sites, gaps between them, and number of alleles for each site, and sequence of the alleles) but placed on a chunk of the human genome instead on p. falciparum
The systematic thing to do for a new PRG is to build these plots and choose a good k.
The upshot of all this is quasimapping is going to be very heavily impacted by restricting to very short kmers for P. falciparum, or conversely, if we can increase k to 11 or 12, we get a big boost.
Options, which I am stating, not advocating/pushing. Some are alternatives, and some could be combined
Increase the threshold for indexing all possible kmers up from 7 to 12 or so.
Allow the user to specify whether to index all (I don't like this option too much but it would allow me to explore the performance benefits of higher k)
For a whole genome PRG, discard the property of ignoring a read which does not overlap the kmer index. Having done this, index some easy subset of the kmers (eg all kmers overlapping allele 0 of each site - ie the reference allele, or maybe two alleles for each site or whatever), and then during mapping add new kmers to the index.
Note Right now I possibly cannot benchmark the benefit of indexing at higher k if the current indexing is too slow. I'm in the process of benchmarking (https://github.com/iqbal-lab-org/gramtools/issues/117) but I expect not to be able to go above k7, whereas the graph above suggests there will be significant value in going higher.