Open ctb opened 7 years ago
(this is a technique that MetaPalette uses)
@dkoslicki @bluegenes this may be a wild stab in the dark, but can we use the new ANI code as a way to "standardize" comparisons between different k-mer sizes to replicate the k-mer size response curve as used in MetaPalette? specifically, this:
We present the algorithm MetaPalette, which uses long k-mer sizes (k = 30, 50) to fit a k-mer “palette” of a given sample to the k-mer palette of reference organisms. By modeling the k-mer palettes of unknown organisms, the method also gives an indication of the presence, abundance, and evolutionary relatedness of novel organisms present in the sample.
(yes, this is functionality I've been wanting for a long time!)
I would imagine it to be as simple (he says, optimistically) as...
cc @bryshalm
Actually, now that I'm thinking of it, I wonder if we could just be using variations in estimated overlap based on k-mer size, without going through the ANI formula?
Anyway, something to consider poking at.
In the FracMinHash sketches, does sourmash save the k-mers that led to the hash values, or just the hash values? If the former, then you could do something like this in order to compute multiple sketch sizes. Not trying to sound like a broken record, just curious :) Though this may not be an issue after the rust-ification of sourmash given the ridiculous speed.
My intuition is similar to yours: ANI/containment values over a range of k's should be able to disentangle strain variation vs incomplete presence. Are you thinking demonstrating this via simulation, or simulation + theory?
In the FracMinHash sketches, does sourmash save the k-mers that led to the hash values, or just the hash values? If the former, then you could do something like this in order to compute multiple sketch sizes. Not trying to sound like a broken record, just curious :) Though this may not be an issue after the rust-ification of sourmash given the ridiculous speed.
just the hash values.
My intuition is similar to yours: ANI/containment values over a range of k's should be able to disentangle strain variation vs incomplete presence. Are you thinking demonstrating this via simulation, or simulation + theory?
excellent!
I'd probably start with real data + simulation :). But I'll loop you in as it looks promising!
In the FracMinHash sketches, does sourmash save the k-mers that led to the hash values, or just the hash values? If the former, then you could do something like this in order to compute multiple sketch sizes. Not trying to sound like a broken record, just curious :)
I don't think this would work, because the k-mers that lead to retained hashes are completely different between different k-sizes. Right? It has to be done on the input side, not on the output side?
This is not really a good idea in many situations, but I felt it was relevant - we could just store the sequences themselves and produces sketches upon demand, per https://github.com/sourmash-bio/sourmash/issues/1647#issuecomment-1022298264 :)
I don't think this would work, because the k-mers that lead to retained hashes are completely different between different k-sizes. Right? It has to be done on the input side, not on the output side?
Surprisingly enough, it does work! The intuition is that when storing the k-mers leading to the hashes in (Frac)MinHash, results in a random sample of k-mers, and so if you truncate these k-mers to, say (k-1)-mers, you still have a random sample, save for a small bias introduced by prefix matches. But this bias factor is small in practice.
But no matter, as long as you're ok with storing hashes for a bunch of different k sizes.
I'd probably start with real data + simulation :). But I'll loop you in as it looks promising!
Great! If it does look promising, Mahmudur and I can start working on the theory
I don't think this would work, because the k-mers that lead to retained hashes are completely different between different k-sizes. Right? It has to be done on the input side, not on the output side?
Surprisingly enough, it does work! The intuition is that when storing the k-mers leading to the hashes in (Frac)MinHash, results in a random sample of k-mers, and so if you truncate these k-mers to, say (k-1)-mers, you still have a random sample, save for a small bias introduced by prefix matches. But this bias factor is small in practice.
hah, indeed, you have a random sample, but it's not equivalent to the sketch you would have formed for the smaller k-size when operating on the complete data!
Interesting kind of sketch tho. Hmm.
so, actually, that IS a different kind of sketch, but it's pretty efficient - if I'm guessing right,
neat!
Indeed! You're right that it's not a FracMinHash/MinHash/bottom sketch for the same hash function, but there exists a hash function such that it would be a bottom sketch for that hash function (modulo prefix collisions).
ok, great, I think I understand it all!
I've been looking for an alternative to MinHash and FracMinHash that would let us generalize the sourmash code base while offering similar time/space/operations. This seems like a good 'un to try! thank you!
discussion of new sketch type moved here :) https://github.com/sourmash-bio/sourmash/issues/2039
back to the original topic - I got excited and just decided try it out.
We do see clear separation of curve slopes, based on different amounts of overlap. thoughts on a next analysis/plot?
See https://github.com/ctb/2022-multi-ksize-strain/blob/main/multiksize.ipynb for full deets.
@bluegenes @dkoslicki
Huh, I would've expected the ANI to go down, not up. Did the containments show a similar behavior? I usually like going down to some really small k-mer size, since I've often seen some sort of "phase transition" happen in the 10-30 range
yeah, some digging needed. I can explain it to myself but I'm not sure I'm right, so will forebear!
built a lot of k-sizes for Shewanella per https://github.com/sourmash-bio/sourmash-examples/issues/7, got the following:
Even more counter intuitive behavior! Later tonight (if I get a chance), I'll dig into it further and see if the corresponding containment values are making sense
when we see a number like "50%" in containment it could be caused by missing content (e.g. low coverage but otherwise exact matches) or strain variation. It seems like we should be able to do a k-mer size response curve to tell the difference between these two - basically, how fast does the similarity drop off at k 21,31,41,51?