Open phiweger opened 3 years ago
Ah! The ksize in the signature is a multiple of 3 of the ksize specified. What I don't understand then (from having read the v4 migration guide and the sketch documentation): I thought the protein-hashing commands would hash, well, proteins, not nucleotide kmers. I'm sure I'm missing something here, so thank you for your help!
Hm. And sourmash index -k 7 ...
does load the corresponding signature. I'm confused.
@phiweger If you run sourmash sig describe
on this signature, do you see k=7
?
If I recall correctly, the decision was made to enable amino-acid sizes for all command-line and python interfaces, but to keep the k=k*3 representation of kmer size within the signature files themselves, in order to maintain compatibility with existing signatures.
also, sketch translate
hashes 21-mers of DNA into 7-mers of aa. It looks like this is confusing no matter what so we opted for describing the output rather than the input and we will transition to having k=7 in the JSON file ...soon :).
will link to relevant issues in a bit, just for posterity!
yes @bluegenes sourmash sig describe
says k=7
:) thx all for the explanations!
A few historical notes -
sourmash sketch dna
does exactly what you expect, ksize=ksizesourmash sketch protein
does exactly what you want, visible ksize=ksize; it's just the internal ksize storage that's wonky for the moment, because we didn't want to update the signature format yet! see https://github.com/dib-lab/sourmash/pull/1277 for rationale and more links. Note that in sourmash < 4, we confusingly always divided protein ksizes by 3, so you'd get nonintuitive output from sig describe
and in the JSON file and... - this was the motivating concern for the change, b/c @bluegenes started working more with protein k-mers and was wondering why she had to set ksize=30 to get aa ksize=10 😆 sourmash sketch translate
has no obvious behavior options - either you specify DNA ksize and then output signature has that ksize but is actually working with ksize/3 amino acids, OR you specify protein ksize and then input ksize is effectively 3x that. Since we wanted sketch translate
signatures to be compatible with sketch protein
signatures, it seemed easiest to have translate
produce signatures with the "correct" protein ksize.compute
worked before (protein ksize*3) was because I implemented DNA first, then translate, and then protein, and so it wasn't clear how borked the protein ksize decision was until too late!These are some good FAQ entries so I'll put them there, and I'll keep this issue open 'til we update the docs appropriately!
I want to sketch a genome to get 7-mers of amino-acids (ie peptides), so I give the new CLI a spin:
However, looking into the resulting signature, I suspect that the params are not applied: