modernatx / seqlike

Unified biological sequence manipulation in Python
https://modernatx.github.io/seqlike
Apache License 2.0
207 stars 21 forks source link

[BUG] `seq_type` handling during instantiation #41

Closed pagpires closed 2 years ago

pagpires commented 2 years ago

I was playing around with the library (it was really nice and great job!) and hit this bug: instantiation like SeqLike('ATCGATC') or SeqLike('ATCGATC', None) will fail. From the doc: 1) seq_like is supposedly optional but is required in the implementation (code) and 2) logic regarding seq_like==None only occurs when sequence is a SeqLike (code) as opposed to a native type and thus seems not very useful? (There's a def determine__type() (code) that I think is intended for the job but it's not used.)

In addition, some general comments regarding dispatch, feel free to ignore them if they are out of context: def _construct_seqlike share the same signature for alphabet and codon_map, and def determine__type_and_alphabet share same signature for seq_type and sequence. Thus it seems clearer to extract the only signature that varies? E.g. for the latter case we only dispatch functions based on alphabet (or even directly use if alphabet==None: alphabet=determine_alphabet(_type, sequence) rather than dispatching). Current setup for these two functions seems to create duplicate logics and is error prone (types for sequence here should be the same?)

Again, happy to file a PR for anything specific above.

pagpires commented 2 years ago

Also a quick note: when dispatching for SeqLike, I guess we can use SeqLikeType.__args__ as the type instead of hardcoding a list of potential types, otherwise it's hard to keep them consistent (since I saw there's a plan to include torch.tensor)

ericmjl commented 2 years ago

@pagpires thank you for chiming in! Yes, it looks like we have something wrong in the docs - by design, SeqLike() requires a sequence of letters and a seq_type to be specified. (I'm looking at the API reference here.) However, it appears that the examples provided here don't show this.

In a recent release, we decided to make 'factory functions', aaSeqLike() and ntSeqLike(), which return proper SeqLike objects but don't require specification of seq_type as an argument. Perhaps we could use some help making those two factory functions a bit more prominent. What are your thoughts on how this could be done better?

ericmjl commented 2 years ago

Also a quick note: when dispatching for SeqLike, I guess we can use SeqLikeType.args as the type instead of hardcoding a list of potential types, otherwise it's hard to keep them consistent (since I saw there's a plan to include torch.tensor)

I think this is a great idea. I'm going to make a new issue for this one as it feels like a unit of work that can be made as a PR.

UPDATE: See issue #42 for further discussion on dispatching.

pagpires commented 2 years ago

In a recent release, we decided to make 'factory functions', aaSeqLike() and ntSeqLike()

That makes sense. Then I think it's just that SeqLike's docstring needs update. It might also help to explicitly mention this api design in the development page in the doc, like "nt/aaSeqlike (user interface for instantiation) - SeqLike (data representation, discouraged to be used for instantiation) - _construct_seqlike (logic under the hood)". Just speaking for myself though.

ericmjl commented 2 years ago

Happy to handle this one. It's a small enough change and I know where to make the changes. :smile:.