sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
476 stars 79 forks source link

add docs on FracMinHash downsampling #1799

Open ctb opened 2 years ago

ctb commented 2 years ago

@drtamermansour asked some questions on slack about how FracMinHash signatures with different scaled values are handled in practice, and I took a look in the docs and couldn't find anything that was clearly written. We should add that somewhere.

(On the plus side, it's pretty well tested, I think?)

Off the top of my head,

This was all actually written up internally in the code base - see https://github.com/sourmash-bio/sourmash/issues/407 and PR https://github.com/sourmash-bio/sourmash/pull/1420 - but the details didn't make it into the docs. Oops!

drtamermansour commented 2 years ago

In the current implementation, when there is a difference between the query and a subject signature, sourmash rescale the DB but not the sample.

I tried:

ctb commented 2 years ago

Yep, SBTs work that way.

(but that will depend to some extent on the database type in question - there are several kinds, including SBTs, LCAs, and collections of signatures)

On Jan 20, 2022, at 2:08 PM, Tamer Mansour @.***> wrote:

In the current implementation, when there is a difference between the query and a subject signature, sourmash rescale the DB but not the sample.

I tried:

• sample scale=500 & DB scale=1000 ==> runtime error (ValueError: new scaled 500 is lower than current sample scaled 1000) • sample scale=2000 & DB scale=1000 ==> works fine — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.