sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

Group-specific hashes from sourmash signature #2495

Closed trickovicmatija closed 1 year ago

trickovicmatija commented 1 year ago

Hello, My idea is to find group-specific hashes from a set of MAG signatures and create a "panhashome" - one signature representing the whole group. If I have two groups, each output signature would contain "core" hashes (found in almost every MAG) and group-specific hashes. Afterward, I would use sourmash gather to quantify those resulting signatures in metagenomic samples.

Do you think the idea makes sense? I've already written some scripts to parse and filter hash values, but when trying with mastiff, the output is empty (even though it doesn't complain about the format of the input signature, so I think I got that right).

Thanks, Matija

ctb commented 1 year ago

definitely makes sense and sounds interesting! panhashomics FTW 😄

One thing that could be happening is mismatched ksizes (mastiff uses k=21). A few controls I would suggest trying -

We do have a bunch of sourmash sig functions that might help construct the panhashome - intersect and merge in particular - and sourmash sig overlap might help confirm that there is overlap between your panhashome and your starting sketches.

trickovicmatija commented 1 year ago

Thanks for the fast reply! I found the problem (this has bothered me for some time now). The "max_hash" is not the maximal hash value in the corresponding signature, and I guess that that value is then used to calculate the "scaled" attribute of a signature. Even though my original signatures are calculated with scaling=1000, the panhashome was described with sourmash sig describe with scaled=1001. If I now understand correctly, the "max_hash" value is the maximal hash value that "murmur 64bit" hf can output with said scaling (the value you refer to as "H/s" in the gather paper). Should it be the same for all signatures generated with this hash function and the same scaling? When I made it that constant value for all my panhashomes (18446744073709552), both mastiff and gather worked!

Thanks for the advice about sourmash sig functions. I wanted a bit more customized - let's say keep hash values if they appear in >=80% of one group and <20% in all others. This is why I felt a need to write custom scripts.

Thanks again, Matija

ctb commented 1 year ago

whew, glad you figured it out!

max_hash should be H / scaled - see the code in https://github.com/sourmash-bio/sourmash/blob/latest/src/sourmash/minhash.py, esp _get_max_hash_for_scaled(scaled) and _get_scaled_for_max_hash:

def _get_max_hash_for_scaled(scaled):
    "Convert a 'scaled' value into a 'max_hash' value."
    if scaled == 0:
        return 0
    elif scaled == 1:
        return get_minhash_max_hash()

    return min(
        int(round(get_minhash_max_hash() / scaled, 0)),
        MINHASH_MAX_HASH
    )

def _get_scaled_for_max_hash(max_hash):
    "Convert a 'max_hash' value into a 'scaled' value."
    if max_hash == 0:
        return 0
    return min(
        int(round(get_minhash_max_hash() / max_hash, 0)),
        MINHASH_MAX_HASH
    )

not sure what happened here, but it sounds to me like a rounding error snuck in somewhere, so if you ever get the urge to track it down I'd be very interested in making sure it doesn't happen again!

in re

Thanks for the advice about sourmash sig functions. I wanted a bit more customized - let's say keep hash values if they appear in >=80% of one group and <20% in all others. This is why I felt a need to write custom scripts.

yes, absolutely, and it's great that you could write custom scripts to do it!

Side note, we are slowly working towards getting a plugin architecture in place that would let you write this as a reusable external script/plugin, e.g. see https://github.com/sourmash-bio/sourmash/issues/2383#issuecomment-1374043175. Just as an FYI ;).

trickovicmatija commented 1 year ago

Thank you very much for the confirmation! I will check it out! I fill definitely check the plugin architecture out! I noticed it on your blog but hadn't time to investigate it.

Thanks again for your help!