sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

"sourmash sketch translate -p k=7,dayhoff" does not respect k in sourmash v4 #1383

Open phiweger opened 3 years ago

phiweger commented 3 years ago

I want to sketch a genome to get 7-mers of amino-acids (ie peptides), so I give the new CLI a spin:

sourmash sketch translate -p k=7,k=10,scaled=1000,dayhoff genome.fasta

However, looking into the resulting signature, I suspect that the params are not applied:

... "signatures":[{"num":0,"ksize":21,"seed":42,"max_hash" ...
phiweger commented 3 years ago

Ah! The ksize in the signature is a multiple of 3 of the ksize specified. What I don't understand then (from having read the v4 migration guide and the sketch documentation): I thought the protein-hashing commands would hash, well, proteins, not nucleotide kmers. I'm sure I'm missing something here, so thank you for your help!

phiweger commented 3 years ago

Hm. And sourmash index -k 7 ... does load the corresponding signature. I'm confused.

bluegenes commented 3 years ago

@phiweger If you run sourmash sig describe on this signature, do you see k=7?

If I recall correctly, the decision was made to enable amino-acid sizes for all command-line and python interfaces, but to keep the k=k*3 representation of kmer size within the signature files themselves, in order to maintain compatibility with existing signatures.

ctb commented 3 years ago

also, sketch translate hashes 21-mers of DNA into 7-mers of aa. It looks like this is confusing no matter what so we opted for describing the output rather than the input and we will transition to having k=7 in the JSON file ...soon :).

will link to relevant issues in a bit, just for posterity!

phiweger commented 3 years ago

yes @bluegenes sourmash sig describe says k=7 :) thx all for the explanations!

ctb commented 3 years ago

A few historical notes -

These are some good FAQ entries so I'll put them there, and I'll keep this issue open 'til we update the docs appropriately!