sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

summary: further improvements to protein handling in sourmash #1525

Open ctb opened 3 years ago

ctb commented 3 years ago

This is an update of and replacement for https://github.com/dib-lab/sourmash/issues/999, which raised a lot of issues around how we were doing protein k-mer calculations.

This issue is being updated after the release of sourmash 4.1.


Over the past year, several of the issues in #999 were resolved by the release of sourmash v4, which introduced sourmash sketch (via https://github.com/dib-lab/sourmash/pull/1159)

Taking from @bluegenes excellent summary, here are the remaining unresolved issues from #999.

I do think differentiating sketches by hash functions #751 is its whole own thing and not specifically protein-esque.


Notes and thoughts:

It would be nice to figure out if #1037, which checks the first 100bp of FASTA files, is a good approach. Thoughts from https://github.com/dib-lab/sourmash/issues/999#issuecomment-647142392 that seem relevant -

https://github.com/dib-lab/sourmash/pull/1277 changed the Python layer so that ksize for protein was "correct" (the actual length of the word, not k*3!). This still needs to be changed at the Rust layer, though, which would involve changing the JSON signature formats and version.

Also see "Next steps for sourmash sketch" https://github.com/dib-lab/sourmash/issues/1169.

ctb commented 3 years ago

here are some notes I took while working through #1159 - copied over from https://github.com/dib-lab/sourmash/issues/999#issuecomment-674157078

Changes to signature computation

Suggested changes to signature computation:

Signature JSON format changes

What signature format changes should we do in tandem? see https://github.com/dib-lab/sourmash/issues/268 for rollup issue

Changes to MinHash API

general issue here, https://github.com/dib-lab/sourmash/issues/338 more here, https://github.com/dib-lab/sourmash/issues/720 and more here, https://github.com/dib-lab/sourmash/issues/885, although that is mostly about docs and tests now.