Closed bluegenes closed 9 months ago
First let me say that while I think this will have minor real-world impact, I can virtually guarantee it will be a source of at least one nasty time-consuming bug in the next year :). So I really don't like this kind of fuzziness! I appreciate the creation of this issue :).
So here we match exact ksize or k=k*3, regardless of ksize or molecule type. This is b/c we don't have access to
is_protein
or other relevant properties unless we load the minhash.
The molecule type is available in the manifest; can't we use that? Or is the problem that 'select' doesn't use the manifest/there may not be a manifest?
First let me say that while I think this will have minor real-world impact, I can virtually guarantee it will be a source of at least one nasty time-consuming bug in the next year :). So I really don't like this kind of fuzziness! I appreciate the creation of this issue :).
agreed! #3028 🙂
The molecule type is available in the manifest; can't we use that? Or is the problem that 'select' doesn't use the manifest/there may not be a manifest?
Yes - in manifest::select
, which is used on collections, we select on moltype.
subset = collection.select(&selection)
produces a pared-down collection with matching params (and scaled values that can be downsampled). After this (or without this, for non-collection sigs) we do select
at the signature level.
Ok, this was actually simpler than I was thinking! Sigs have hash_function
, which in the JSON signature is always 0.murmur64
. However, mh.hash_function
is generated using the molecule type in the JSON, so it contains this info and we can match on it.
Protein k-mer sizes are k=k*3 internally, but k in a manifest.
In
signature::Select
, we account for this discrepancy like this:So here we match exact ksize or k=k*3, regardless of ksize or molecule type. This is b/c we don't have access to
is_protein
or other relevant properties unless we load the minhash.This is fine for most common uses, b/c
signature::select
with signatures of the rightksize
alreadyBut we could be safer about it anyway.
Just after the
ksize
check, we load the minhash to check scaled and downsample if needed. We could combine these checks and use the minhash to determine molecule type (via matching onencodings::HashFunctions
, i think)