sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
473 stars 80 forks source link

add translation support to `sourmash sketch fromfile` #1912

Open ctb opened 2 years ago

ctb commented 2 years ago

right now, the fromfile format doesn't support a simple way to produce translated sequence - presumably we'd need to add a CDS column or something, or else build workflows (elsewhere) to do prodigal-style coding sequence extraction, although that would only work for bacteria and archaea, so a CDS column might still be necessary.

See @bluegenes comment too.

bluegenes commented 2 years ago

I find myself using fromfile for everything these days, because it makes naming sketches properly so easy!!

So I would like us to support translate if we can -- perhaps as an additional param, e.g. -p k=10,k=7,scaled=200,protein,translate? Note it needs to have both of these, because we could alternatively translate into dayhoff.

I see your point about eukaryotes -- I would be happy to use a cds_filename column for this functionality.

current use case: a bunch of MAGs. Yes, I'll run prodigal-style translate separately, but for reasons I also want to build some 6-frame signatures