sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
477 stars 79 forks source link

safer ksize selection in `signature::Select` #3026

Closed bluegenes closed 9 months ago

bluegenes commented 9 months ago

Protein k-mer sizes are k=k*3 internally, but k in a manifest.

Insignature::Select, we account for this discrepancy like this:

valid = if let Some(ksize) = selection.ksize() {
    let k = s.ksize() as u32;
    k == ksize || k == ksize * 3

So here we match exact ksize or k=k*3, regardless of ksize or molecule type. This is b/c we don't have access to is_protein or other relevant properties unless we load the minhash.

This is fine for most common uses, b/c

  1. If using a manifest, manifest selection works on k, so we should only be using signature::select with signatures of the right ksize already
  2. Even if not using manifest, we rarely use k-mer sizes that are 3x other ksizes we use.

But we could be safer about it anyway.

Just after the ksize check, we load the minhash to check scaled and downsample if needed. We could combine these checks and use the minhash to determine molecule type (via matching on encodings::HashFunctions, i think)

ctb commented 9 months ago

First let me say that while I think this will have minor real-world impact, I can virtually guarantee it will be a source of at least one nasty time-consuming bug in the next year :). So I really don't like this kind of fuzziness! I appreciate the creation of this issue :).

So here we match exact ksize or k=k*3, regardless of ksize or molecule type. This is b/c we don't have access to is_protein or other relevant properties unless we load the minhash.

The molecule type is available in the manifest; can't we use that? Or is the problem that 'select' doesn't use the manifest/there may not be a manifest?

bluegenes commented 9 months ago

First let me say that while I think this will have minor real-world impact, I can virtually guarantee it will be a source of at least one nasty time-consuming bug in the next year :). So I really don't like this kind of fuzziness! I appreciate the creation of this issue :).

agreed! #3028 🙂

The molecule type is available in the manifest; can't we use that? Or is the problem that 'select' doesn't use the manifest/there may not be a manifest?

Yes - in manifest::select, which is used on collections, we select on moltype. subset = collection.select(&selection) produces a pared-down collection with matching params (and scaled values that can be downsampled). After this (or without this, for non-collection sigs) we do select at the signature level.


Ok, this was actually simpler than I was thinking! Sigs have hash_function, which in the JSON signature is always 0.murmur64. However, mh.hash_function is generated using the molecule type in the JSON, so it contains this info and we can match on it.