can we distinguish between strain variation and missing content?

ctb commented 7 years ago

when we see a number like "50%" in containment it could be caused by missing content (e.g. low coverage but otherwise exact matches) or strain variation. It seems like we should be able to do a k-mer size response curve to tell the difference between these two - basically, how fast does the similarity drop off at k 21,31,41,51?

ctb commented 4 years ago

(this is a technique that MetaPalette uses)

ctb commented 2 years ago

@dkoslicki @bluegenes this may be a wild stab in the dark, but can we use the new ANI code as a way to "standardize" comparisons between different k-mer sizes to replicate the k-mer size response curve as used in MetaPalette? specifically, this:

We present the algorithm MetaPalette, which uses long k-mer sizes (k = 30, 50) to fit a k-mer “palette” of a given sample to the k-mer palette of reference organisms. By modeling the k-mer palettes of unknown organisms, the method also gives an indication of the presence, abundance, and evolutionary relatedness of novel organisms present in the sample.

(yes, this is functionality I've been wanting for a long time!)

I would imagine it to be as simple (he says, optimistically) as...

calculate ANI for multiple k-mer sizes
if ANI decreases for longer k-mer sizes (it should never increase, right?) then that indicates presence of strain variation!
if ANI stays the same, then that indicates that the sequence being matched is present as an ~exact match, but only part of it is present.

cc @bryshalm

ctb commented 2 years ago

Actually, now that I'm thinking of it, I wonder if we could just be using variations in estimated overlap based on k-mer size, without going through the ANI formula?

Anyway, something to consider poking at.

dkoslicki commented 2 years ago

In the FracMinHash sketches, does sourmash save the k-mers that led to the hash values, or just the hash values? If the former, then you could do something like this in order to compute multiple sketch sizes. Not trying to sound like a broken record, just curious :) Though this may not be an issue after the rust-ification of sourmash given the ridiculous speed.

My intuition is similar to yours: ANI/containment values over a range of k's should be able to disentangle strain variation vs incomplete presence. Are you thinking demonstrating this via simulation, or simulation + theory?

ctb commented 2 years ago

In the FracMinHash sketches, does sourmash save the k-mers that led to the hash values, or just the hash values? If the former, then you could do something like this in order to compute multiple sketch sizes. Not trying to sound like a broken record, just curious :) Though this may not be an issue after the rust-ification of sourmash given the ridiculous speed.

just the hash values.

My intuition is similar to yours: ANI/containment values over a range of k's should be able to disentangle strain variation vs incomplete presence. Are you thinking demonstrating this via simulation, or simulation + theory?

excellent!

I'd probably start with real data + simulation :). But I'll loop you in as it looks promising!

ctb commented 2 years ago

In the FracMinHash sketches, does sourmash save the k-mers that led to the hash values, or just the hash values? If the former, then you could do something like this in order to compute multiple sketch sizes. Not trying to sound like a broken record, just curious :)

I don't think this would work, because the k-mers that lead to retained hashes are completely different between different k-sizes. Right? It has to be done on the input side, not on the output side?

This is not really a good idea in many situations, but I felt it was relevant - we could just store the sequences themselves and produces sketches upon demand, per https://github.com/sourmash-bio/sourmash/issues/1647#issuecomment-1022298264 :)

dkoslicki commented 2 years ago

I don't think this would work, because the k-mers that lead to retained hashes are completely different between different k-sizes. Right? It has to be done on the input side, not on the output side?

Surprisingly enough, it does work! The intuition is that when storing the k-mers leading to the hashes in (Frac)MinHash, results in a random sample of k-mers, and so if you truncate these k-mers to, say (k-1)-mers, you still have a random sample, save for a small bias introduced by prefix matches. But this bias factor is small in practice.

But no matter, as long as you're ok with storing hashes for a bunch of different k sizes.

I'd probably start with real data + simulation :). But I'll loop you in as it looks promising!

Great! If it does look promising, Mahmudur and I can start working on the theory

ctb commented 2 years ago

I don't think this would work, because the k-mers that lead to retained hashes are completely different between different k-sizes. Right? It has to be done on the input side, not on the output side?

Surprisingly enough, it does work! The intuition is that when storing the k-mers leading to the hashes in (Frac)MinHash, results in a random sample of k-mers, and so if you truncate these k-mers to, say (k-1)-mers, you still have a random sample, save for a small bias introduced by prefix matches. But this bias factor is small in practice.

hah, indeed, you have a random sample, but it's not equivalent to the sketch you would have formed for the smaller k-size when operating on the complete data!

Interesting kind of sketch tho. Hmm.

ctb commented 2 years ago

so, actually, that IS a different kind of sketch, but it's pretty efficient - if I'm guessing right,

compute a "founder" FracMinHash sketch for k=51, storing k-mers sequences as well.
then, descendant sketches (no longer FracMinHash bottom sketches, but still comparable as long as they share the same founder ksize) can be computed for any k < 51.

neat!

dkoslicki commented 2 years ago

Indeed! You're right that it's not a FracMinHash/MinHash/bottom sketch for the same hash function, but there exists a hash function such that it would be a bottom sketch for that hash function (modulo prefix collisions).

ctb commented 2 years ago

ok, great, I think I understand it all!

I've been looking for an alternative to MinHash and FracMinHash that would let us generalize the sourmash code base while offering similar time/space/operations. This seems like a good 'un to try! thank you!

ctb commented 2 years ago

discussion of new sketch type moved here :) https://github.com/sourmash-bio/sourmash/issues/2039

ctb commented 2 years ago

back to the original topic - I got excited and just decided try it out.

Screen Shot 2022-05-06 at 10 53 07 AM

We do see clear separation of curve slopes, based on different amounts of overlap. thoughts on a next analysis/plot?

See https://github.com/ctb/2022-multi-ksize-strain/blob/main/multiksize.ipynb for full deets.

@bluegenes @dkoslicki

dkoslicki commented 2 years ago

Huh, I would've expected the ANI to go down, not up. Did the containments show a similar behavior? I usually like going down to some really small k-mer size, since I've often seen some sort of "phase transition" happen in the 10-30 range

ctb commented 2 years ago

yeah, some digging needed. I can explain it to myself but I'm not sure I'm right, so will forebear!

ctb commented 2 years ago

built a lot of k-sizes for Shewanella per https://github.com/sourmash-bio/sourmash-examples/issues/7, got the following:

Screen Shot 2022-05-06 at 2 33 38 PM

dkoslicki commented 2 years ago

Even more counter intuitive behavior! Later tonight (if I get a chance), I'll dig into it further and see if the corresponding containment values are making sense

ctb commented 2 years ago

you see the expect behavior when you look at intersect_bp across the ksizes -

anyway, all of the relevant CSVs (and code) are in the repo.

sourmash-bio / sourmash

can we distinguish between strain variation and missing content? #267