Open ctb opened 3 years ago
here are some notes I took while working through #1159 - copied over from https://github.com/dib-lab/sourmash/issues/999#issuecomment-674157078
Suggested changes to signature computation:
What signature format changes should we do in tandem? see https://github.com/dib-lab/sourmash/issues/268 for rollup issue
general issue here, https://github.com/dib-lab/sourmash/issues/338 more here, https://github.com/dib-lab/sourmash/issues/720 and more here, https://github.com/dib-lab/sourmash/issues/885, although that is mostly about docs and tests now.
add_sequence(<str>, moltype=...)
moltype
should be enummoltype
could be assumed to be DNA
for DNA minhashes, aa|prot|protein
for protein minhashse.dna|rna_top|rna_bottom
for protein minhashes to translate incoming?
This is an update of and replacement for https://github.com/dib-lab/sourmash/issues/999, which raised a lot of issues around how we were doing protein k-mer calculations.
This issue is being updated after the release of sourmash 4.1.
Over the past year, several of the issues in #999 were resolved by the release of sourmash v4, which introduced
sourmash sketch
(via https://github.com/dib-lab/sourmash/pull/1159)Taking from @bluegenes excellent summary, here are the remaining unresolved issues from #999.
k=33
should be comparable with proteink=11
(equivalent ksizes). Differentiating by hash functions per #751 could facilitate this. In #574, @luizirber pointed out that #751 is already enabled in the rust code?add_sequence
andadd_protein
? From #720: "in the provided example in #701, I'm not sure why add_sequence doesn't complain when it's given a bunch of non-ACTG characters."I do think differentiating sketches by hash functions #751 is its whole own thing and not specifically protein-esque.
Notes and thoughts:
It would be nice to figure out if #1037, which checks the first 100bp of FASTA files, is a good approach. Thoughts from https://github.com/dib-lab/sourmash/issues/999#issuecomment-647142392 that seem relevant -
add_dna_sequence
andadd_protein_sequence
at the API level;https://github.com/dib-lab/sourmash/pull/1277 changed the Python layer so that ksize for protein was "correct" (the actual length of the word, not k*3!). This still needs to be changed at the Rust layer, though, which would involve changing the JSON signature formats and version.
Also see "Next steps for sourmash sketch" https://github.com/dib-lab/sourmash/issues/1169.