modernatx / seqlike

Unified biological sequence manipulation in Python
https://modernatx.github.io/seqlike
Apache License 2.0
207 stars 21 forks source link

Support arbitrary alphabets #49

Closed ericmjl closed 2 years ago

ericmjl commented 2 years ago

It'd be nice to support arbitrary alphabets for sequences that are not necessarily string-type. For e.g. we may want to do sequence of codons, or sequence of other entities.

Doing so would allow us to access the to_onehot() or to_index() capabilities of SeqLike objects without necessarily being bound to BioPython SeqRecord/Seq objects.

Potential challenges:

  1. We would break the "default to SeqRecords pair" that we assume in SeqLike. A list of codons is neither!
  2. We may need to rearchitect the SeqLike object such that there is a .sequence and .alphabet, which the encoder functions expect (?). .to_*() functions.
  3. We may need a more generic SeqLike object from which our current SeqLikes inherit.

A good concrete first step here is to create an Abstract Base Class for discussion purposes.

andrewgiessel commented 2 years ago

This is a great idea. Implementing the ABC is a good first step. There are a number of abstract classes in https://docs.python.org/3/library/collections.abc.html that we might be able to inherit from, or use as inspiration.

On the SeqLike side, I think all we would need to do is implement:

@property
def sequence(self):
    return self._seqrecord.seq

The big difference would be that in the base class, .sequence would be what was passed into the constructor along with an alphabet.

Methods to pull out/implement in abstract base class might include:

  1. to_str() - probably now ''.join(self.sequence), because it could very well be a list
  2. to_index() - almost as is
  3. to_onehot() - almost as is
  4. apply() - almost as is. relies on __deepcopy__()
  5. count() - as is
  6. find() - this could be implemented with a while loop and self.sequence
  7. __len__() - len(self.sequence)
  8. __contains__(x) - x in self.sequence
  9. iter() - iter(self.sequence)

Note that all these methods would depend on self.sequence and potentially self.alphabet.

andrewgiessel commented 2 years ago

One open question is what "form" to take as input in the base class (essentially, sequence, index, or one-hot). I think we need to support all three, which means dispatching in the constructor to build .sequence if the input is in index or one-hot form.

This constructor logic would not be used in SeqLike (i.e., no super().__init__() call), and in fact, all we'd use in the inheritance are the interface of having .sequence and .alphabet and the methods that use them.