summary: further improvements to protein handling in sourmash

sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.

Other

476 stars 79 forks source link

This is an update of and replacement for https://github.com/dib-lab/sourmash/issues/999, which raised a lot of issues around how we were doing protein k-mer calculations.

This issue is being updated after the release of sourmash 4.1.

Over the past year, several of the issues in #999 were resolved by the release of sourmash v4, which introduced sourmash sketch (via https://github.com/dib-lab/sourmash/pull/1159)

Taking from @bluegenes excellent summary, here are the remaining unresolved issues from #999.

Translated k=33 should be comparable with protein k=11 (equivalent ksizes). Differentiating by hash functions per #751 could facilitate this. In #574, @luizirber pointed out that #751 is already enabled in the rust code?
Potential translation improvements:
- translate whole sequence, then emit kmers (#664)
- pad translated seqs (#659)
- potentially add 3-frame translation (#657)
Potentially enhance input checks for add_sequence and add_protein? From #720: "in the provided example in #701, I'm not sure why add_sequence doesn't complain when it's given a bunch of non-ACTG characters."

I do think differentiating sketches by hash functions #751 is its whole own thing and not specifically protein-esque.

Notes and thoughts:

It would be nice to figure out if #1037, which checks the first 100bp of FASTA files, is a good approach. Thoughts from https://github.com/dib-lab/sourmash/issues/999#issuecomment-647142392 that seem relevant -

I think we do need both command line and API level checking. The command line can make use of additional info (filename, aggregated across sequences, etc) while the API has to do the trickier job of working with only the sequence it's given.
I am leaning towards add_dna_sequence and add_protein_sequence at the API level;
it's not 100% clear how robust it will be to check that any given k-mer is DNA vs prot;
one strategy might be to look at what fraction of k-mers are valid alphabet;

https://github.com/dib-lab/sourmash/pull/1277 changed the Python layer so that ksize for protein was "correct" (the actual length of the word, not k*3!). This still needs to be changed at the Rust layer, though, which would involve changing the JSON signature formats and version.

Also see "Next steps for sourmash sketch" https://github.com/dib-lab/sourmash/issues/1169.

Signature JSON format changes

What signature format changes should we do in tandem? see https://github.com/dib-lab/sourmash/issues/268 for rollup issue

ksize change for protein https://github.com/dib-lab/sourmash/issues/574

move hash function into minhash section?

rename 'signatures' list to 'minhashes'?

support actual command-line concatenation of signatures? https://github.com/dib-lab/sourmash/issues/1093

could add bp and input file list per https://github.com/dib-lab/sourmash/issues/246, https://github.com/dib-lab/sourmash/issues/769

ould also add information about preprocessing per https://github.com/dib-lab/sourmash/issues/269 on a per-command basis; OR, just add information e.g. input type was genome, input type was reads, etc. based on command executed.

Changes to MinHash API

consider lifting things out to signature, too, per https://github.com/dib-lab/sourmash/issues/616

switch to using moltype enums in creation https://github.com/dib-lab/sourmash/issues/1136

add_sequence changes per https://github.com/dib-lab/sourmash/issues/186? add_sequence(<str>, moltype=...)

moltype should be enum
moltype could be assumed to be DNA for DNA minhashes, aa|prot|protein for protein minhashse.
could have it be dna|rna_top|rna_bottom for protein minhashes to translate incoming?

do we want to make different MinHash classes for different moltypes?

protein MinHashes are more complicated than DNA..
the default ksize/scaled/etc could mirror the CLI defaults

sourmash-bio / sourmash

summary: further improvements to protein handling in sourmash #1525

Changes to signature computation

Signature JSON format changes

Changes to MinHash API