matsengrp / TemplatedMutagenesis-1

0 stars 0 forks source link

Pre-fetching k-mers for matching #5

Closed matsen closed 6 years ago

matsen commented 6 years ago

It seems to me that if we want exact matches to scale to big data sets, the following would be viable.

Simply scan through the "donor" set of genes for all k-mers and tabulate where they occur. This will result in a dictionary keyed on the k-mer, and when we see a mutation we can simply find the matches with a dictionary lookup. Note dictionary accesses are O(1).

This dictionary shouldn't be very big-- with 400 genes of length 300, we have 120,000 sites, which is an upper limit of the number of k-mers. The real number is probably 1/5th of that because of shared k-mers (this will depend on k-mer length), but even 120K is not that bad.

In fact, taking this to its logical extent, we might consider just making a single table of the "donor" k-mers, and when we scan sequences just record the kmer for each mutation. The rest of the information is in the k-mer table.

jfukuyama commented 6 years ago

This worked well, it's much faster now.