sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

Unable to Perform Subtract Operation on Signatures with Different Scales #2683

Open shokrof opened 1 year ago

shokrof commented 1 year ago

Hey, I generated Sourmash signatures (scale 1000) for WGS samples of humans and utilized kSpider for clustering based on these signatures. However, I discovered that the resulting clusters exhibited a bias towards sex. To mitigate this bias, I attempted to remove the hashes unique to chrY by creating signatures of chrY with a scale 1. Unfortunately, I encountered this error when calling the subtract function between signatures of different scales. Strangely enough, the subtract operation works when both signatures have the same scale (1000).

incompatible minhashes; specify -k and/or molecule type.

Questions:

Why is it not possible to subtract signatures with different scales? Is there a workaround available to address this issue? How effective will filtering chrY hashes be using a scaled 1000 chrY signature?

Please provide any guidance or solutions you may have. regards, Moustafa

ctb commented 1 year ago

Some hot takes -

Why is it not possible to subtract signatures with different scales?

As with most (all?) scaled sketch operations, the operations can only be done in a sensible way on sketches with the same scaled value. This is intrinsic to FracMinHash sketches - consider a situation where there is a hash that would be in sketch A at scaled=500 but isn't in it at scaled=1000; if that hash is present in sketch B at scaled=500 and you try to subtract B from A, the result would not be well defined. (It would either "fail" silently", or fail loudly!)

Or, to put it another way, if you try to subtract hashes at a lower scaled value from a sketch, sourmash does not know if they're not present because the k-mers truly aren't present in the original data set, or if they simply weren't included in the higher scaled value sketch.

Is there a workaround available to address this issue?

The proximal fix is to support downsampling in sig subtract with a --downsample argument. I don't know if it should do it automatically tho, and in any case this would be a major version update.

The workaround for now is to use sourmash sig downsample to make all the sketches have the same scaled value.

How effective will filtering chrY hashes be using a scaled 1000 chrY signature?

"Perfectly" effective as long as you are filtering them from a scaled=1000 sketch ;).

mr-eyes commented 1 year ago

Fits the context here:) https://github.com/mr-eyes/sourmash_plugin_subtract_and_inflate