sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

Custom alphabet or moltype? #3171

Open olgabot opened 4 months ago

olgabot commented 4 months ago

Hello, Hope you're doing well! I was wondering if it would be possible in the future to support custom moltypes, e.g. if I wanted to do a riff on the Dayhoff alphabet where arginines were a special category because I wanted to look for arginine conservation specifically. Is that something that could be possible? Thank you so much! Warmest, Olga

ctb commented 4 months ago

Possible, but not proximal? :(

You could split this into two distinct phases -

  1. generating the sketches. This is in some sense easy, since you can easily write your own code to generate hash values and just add them to a MinHash object; @luizirber and I have both done this at various times. The only catch with this is you have to be responsible for making sure you catch incompatible sketches yourself - you wouldn't want to compare OlgaCustom sketches to regular protein sketches.
  2. Adding custom sketch types into sourmash. This is valuable and important but not straightforward at the moment. In brief, the simplest idea would be to add support for different hash function identifier strings into sourmash. Please see the discussion in https://github.com/sourmash-bio/sourmash/issues/1659 and https://github.com/sourmash-bio/sourmash/issues/751.

I guess a third would be "implement fast sketching in Rust core", but I would argue with (2) you don't really need to do this - you can write your own plugin/sketching code as in (1) and have it remain outside of core indefinitely.

Related issues: