Closed ksavhughes closed 4 months ago
You are right about the k-mer/minimizer. The minimizer is the name of the chosen k-mer (k=19
by default) in a window ( w=31
by default). The minimizer is not randomly picked, but the lexicographically smallest k-mer from all the k-mers in the window is chosen. Note that the k-mers on a window are also considered in their reverse complement form. So there are w-k+1
k-mers in a window, double that considering the reverse complement: (31 - 19 + 1) * 2 = 26
k-mers per window and just one is selected and stored.
The input reads are split based on the same rules and values for w
and k
explained above for the reference genomes. 70 is the answer for your question.
Note that you may have max. 70 minimizers in a sequence of 100bp. Usually this number is way smaller, since overlapping windows close to each other will often have the same minimizer and this is stored just once. That is also another reason that the lexicographic order is used. That means if you look at 3rd column of the .all
or .one
files generated by ganon classify
for reads of 100bp, you will see values in the lower 20-30, depending on your sequences.
The thresholds for cutoff
and filter
will consider the max. number of minimizers as the upper bound for the relative calculation. For example: a 100bp sequence could have theoretically 70 distinct minimizers but just has 30 unique minimizers. Those 30 minimizers are queried against the references and if just 15 of them match a certain species, you could say you have a 50% match. This percentage can be used to filter results in the --rel-cutoff
and --rel-filter
.
I have a few clarifying questions about the kmer/minimizer stuff because it keeps tripping me up. Correct me if I am wrong, but how I picture the kmer/window size thing working is that a reference genome has a certain number of windows of length 31 (if we're using the defaults) and only one kmer of length 19 is actually stored in the database per window. Is the kmer that is kept per window random? Is there any filtering for low-complexity or repeat regions?
Now moving on to how this affects the classification, specifically how cutoff and filter thresholds will function... Are input reads split up based on the window-size or kmer-size? For example, say the window-size (31) is larger than the kmer-size (19), would the number of possible kmers in a read of length 100bp, be 82 or 70 total kmers? I’m assuming the reads are split based on the kmer size, but I wanted to double check, especially if I'm wrong about the first part of this message!