sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

sbt_gather documentation #129

Closed phiweger closed 7 years ago

phiweger commented 7 years ago

Dear all,

What does the sbt_gather command do exactly? On Titus' blog the following appears in the comments section:

Second, the asymmetry between query subsampling and subject subsampling. We have a solution to this but it's still several blog posts out and the software is not robust. Regardless, it is all in the sourmash version you have installed -- build bigger minhashes for your query by doing something like '-n 10000' and then look at the 'sbt_gather' command.

and that sbt_gather

Tries to pull a MinHash signature apart into its constituents.

  1. What does that mean exactly? I managed to run the command, but ran into more questions:

  2. Why can it only be used with signatures that have been computed with both, --with-cardinality and --scaled?

  3. What does --scaled do? E.g. what does --scaled 50 "mean"? I see that the signature is shorter than what I set with the -n argument, and that the "number" field in the signature JSON has some large integer in it, but I couldn't figure out from these changes what exactly occurs. If I were to guess is that it is related to the minhash "sampling rate", i.e. n / cardinality?

  4. Does the sbt_gather method have any relation to e.g. this asymmetric minhash publication?

Thank you very much for your help!

ctb commented 7 years ago

Hi, sbt_gather is a demo implementation based on some research concepts that we haven't fully explored. @philliptbrooks is working on evaluating the basic approach, and I would be happy to put you in touch with him via e-mail (titus@idyll.org to reach me).

A few overly specific answers:

I had not seen the asymmetric minhash publication - thanks! I'll have to read it and understand it before commenting.

phiweger commented 7 years ago

Ah, I see. Could you elaborate on what "banded shingling" is? Does it mean splitting the full input set of k-mers/ their hashes into bands ? If so, given b bands, you then select the x_b smallest hash values from each band so that sum(x) == desired_resolution, right?

Hearing "band" and "minhash" reminds me of the "metahashing" that you can do to a minhash signature, which reduces the nearest-neighbor search space significantly, see chapter 3 of MMDS, 3.4.1. Wondering if this is related.

ctb commented 7 years ago

On Tue, Feb 07, 2017 at 11:49:05PM -0800, viehwegerlib wrote:

Ah, I see. Could you elaborate on what "banded shingling" is? Does it mean splitting the full input set of k-mers/ their hashes into bands ? If so, given b bands, you then select the x_b smallest hash values from each band so that sum(x) == desired_resolution, right?

Almost - what we're doing is selecting one band (the lowest, currently, but can be any band) to pay attention to, and discarding the rest of the kmers. The number of bands ~ band size is chosen to set the desired resolution, e.g. if --scaled L, then 4**64/L bands, and 1/L of the k-mers are sampled for signature creation.

Hearing "band" and "minhash" reminds me of the "metahashing" that you can do to a minhash signature, which reduces the nearest-neighbor search space significantly, see chapter 3 of MMDS, 3.4.1. Wondering if this is related.

I don't think it is, directly, but all of this stuff is entangled :)

phiweger commented 7 years ago

Lets see if I understand correctly:

The number of bands ~ band size is chosen to set the desired resolution, e.g. if --scaled L, then 4**64/L bands, and 1/L of the k-mers are sampled for signature creation.

  1. The "lowest" band means the lowest lexicographically ordered set of k-mers (of size_band), i.e. the k-mer space is lexicographically ordered prior to applying bands?
  2. Sorry if this might be obvious, but where does the "64" come from?
luizirber commented 7 years ago

64 bits is the size of our hash function output (and that 464 should have been 264)

On Wed, Feb 15, 2017 at 12:49 PM, viehwegerlib notifications@github.com wrote:

Lets see if I understand correctly:

The number of bands ~ band size is chosen to set the desired resolution, e.g. if --scaled L, then 4**64/L bands, and 1/L of the k-mers are sampled for signature creation.

  1. The "lowest" band means the lowest lexicographically ordered set of k-mers (of size_band), i.e. the k-mer space is lexicographically ordered prior to applying bands?
  2. Sorry if this might be obvious, but where does the "64" come from?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dib-lab/sourmash/issues/129#issuecomment-280134398, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAZ8jliS6ktNdwg8JLcla6PZ6we2FfEks5rc2TmgaJpZM4L5a9h .

phiweger commented 7 years ago

Ah, ok. What I don't get is this: You focus on one (arbitrary) band of k-mers, discarding all the others. Would it make a difference if you selected n random k-mers, where n is the size of one band? (I promise to shut up after that question :)

luizirber commented 7 years ago

We are picking the lowest band because it is a MinHash, if you get the highest band you get a MaxHash.

Being consistent in how you choose kmers (minimum, or maximum) is a way of biasing the selection to make signatures comparable (which is harder if you select random kmers).

(And please don't shut up =] )

On Wed, Feb 15, 2017 at 1:01 PM, viehwegerlib notifications@github.com wrote:

Ah, ok. What I don't get is this: You focus on one (arbitrary) band of k-mers, discarding all the others. Would it make a difference if you selected n random k-mers, where n is the size of one band? (I promise to shut up after that question :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dib-lab/sourmash/issues/129#issuecomment-280137587, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAZ8hMJuXOD8vAA9g6mB3tA1E3bTNKVks5rc2fBgaJpZM4L5a9h .

phiweger commented 7 years ago

I see, this confused me a little:

Almost - what we're doing is selecting one band (the lowest, currently, but can be any band).

Makes perfect sense. Thanks a lot!