sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

emit during stream #283

Closed phiweger closed 7 years ago

phiweger commented 7 years ago

Is there a way when streaming sequences into sourmash to every now and then "emit" the signature, while the overall process continues? Like peeking at the result.

ctb commented 7 years ago

not currently, although there is no barrier to it algorithmically or from the API perspective.

phiweger commented 7 years ago

ah yes, of course (just remembered the corresponding API section).

given a potentially infinite metagenomic data set, but from the same "material" (e.g. some river water) would you then prefer to

basically my question refers to something like "signature saturation/ convergence". i did not find any clear advice in the classical minhash literature of broder et al.

thank you very much!

ctb commented 7 years ago

On Mon, Jun 12, 2017 at 06:27:06AM -0700, Adrian Viehweger wrote:

ah yes, of course (just remembered the corresponding API section).

:)

given a potentially infinite metagenomic data set, but from the same "material" (e.g. some river water) would you then prefer to

  • compute one minhash signature and update it continuously or
  • compute multiple signatures sequentially based on windows in the data stream? (in a surveillance context)

basically my question refers to something like "signature saturation/ convergence". i did not find any clear advice in the classical minhash literature of broder et al.

thank you very much!

ooooh I have lots of thoughts but almost all guesswork/intuition at this point!

With large metagenomics data sets, there is no natural breakpoint or transition

So, one of the primary motivations of --scaled is that you really only need to calculate a single signature for an arbitrarily large data set, and the resolution specified with --scaled allows you to achieve the specified sensitivity without knowing how big the data set is. This is precisely to address the streaming use case above ;)

We could pretty easily do something like pause every 1000 hashes, run gather, classify them, and then ignore those classified hashes for future gather execution. It would require some hacking but nothing tricky. See #279 for a similar idea.

One important aspect tho is you need to run some sort of k-mer abundance trimming on things so as to avoid collecting all the sequencing errors. We are partial to trim-low-abund.py in khmer (which is streaming and will do the above), but we haven't worked out parameters. You might first look at the work that Pall Melsted did with mash where all single-count k-mers are ignored until you see them again. You could do the same, albeit less efficiently, by running sourmash with --track-abundance but then ignoring all hashes with a count of 1.

(We have not done much parameter testing and on truly large metagenomes you will want to use something like trim-low-abund.py.)

Anyway... lots to think about. Happy to keep chatting!

phiweger commented 7 years ago

thank you very much for this in depth answer, which should get me started.

what I am wondering about is changes in a metagenome. say we sequence ice cream, and suddenly salmonella is introduced. would the --scaled signature "take notice" immediately or would the weight of the previous data cause some sort of lag before the signature "responds" (very hand wavey, sorry)?

ctb commented 7 years ago

On Mon, Jun 12, 2017 at 01:04:29PM -0700, Adrian Viehweger wrote:

thank you very much for this in depth answer, which should get me started.

what I am wondering about is changes in a metagenome. say we sequence ice cream, and suddenly salmonella is introduced. would the --scaled signature "take notice" immediately or would the weight of the previous data cause some sort of lag before the signature "responds" (very hand wavey, sorry)?

the new hashes would accrue immediately, but the overall similarity would of course change slowly.

but here you're talking about introducing a new sample...

it would certainly be straightforward to compare a window from the last 10m reads with the next 10m reads. Figuring out how to pick the windowsize is left as an eercise for the reader :)

phiweger commented 7 years ago

hehe thank you :)

phiweger commented 7 years ago

Also: Can scaled signatures be merged the same way unnormalized sigs can, i.e. concatenate the hashes and select the smallest k? I guess the result is a normalized merged signature right?

ctb commented 7 years ago

yes - but simply concatenate the hashes, no subselection.

phiweger commented 7 years ago

thanks a lot