Open ndousis opened 1 year ago
Thanks for opening @ndousis !
Keeping the AA and NT records in sync (if we can) is a core design tenant of SeqLike and I am very hesitant to change that. I do appreciate how it adds a lot of overhead, so I've explored a couple of options below.
Your flag idea could potentially work. Optional arguments to __getitem__
are possible, but you can't pass them in unless you call the method directly rather than use the [] syntax. i.e.:
s = SeqLike('aaacccgggttt', seq_type="NT")
# the following are the same
s[0:3]
s.__getitem__(slice(0, 3, None))
# assuming the following signature,
# def __getitem__(self, index, safe=True) -> "SeqLike":
# then the following statement works and `safe` is False within `__getitem__`
s.__getitem__(slice(0, 3, None), safe=False)
I've implemented this on my machine (but with behavior preserving defaults) as well as not translating if the NT slice is not a triplet. The latter check is a good suggestion regardless. But you can see that the code is a bit verbose because you have to manually make a slice object.
That said, I do not think the flag is worth implementing unless there is a compelling reason you must use SeqLikes for these sections of your workflow. I would suggest instead to explore using the underlying NT SeqRecord and then remaking any SeqLikes if needed afterwards. This would be especially do-able if you don't need the resulting sequence but only a calculated value on sliding windows. I feel like this is your best option here.
Another alternative would be to subclass SeqLike for your specific use case and override __getitem__
there.
Let me know your thoughts
If the intended behavior is to keep AA and NT in sync, then perhaps we should implement that behavior more directly rather than relying on a re-translation step. As __getitem__()
is implemented currently, a sliding NT window with a single nucleotide stride would yield many out-of-frame AA sequences-- so the AA sequence is practically no longer synchronized. Instead, what if __getitem()__
applied a modulo operation to maintain the codon-AA correspondence using "X" or similar at the 5' and 3' flanks?
As currently implemented, slicing a SeqLike that contains AA and NT records yields:
In particular, the computational overhead of translating the NT sequence during each
__getitem__()
call can be very tedious (e.g., when applying a long sliding window over an entire sequence).I propose that we eliminate the second behavior (attempting to slice the AA record when NT is primary)-- i.e., when slicing the NT SeqLike, all bets are off with respect to the AA record, and the user must explicitly translate the SeqLike to yield a new AA record.
Alternatively, we could add a flag to optionally retain the AA record when True (default False; similar to how BioPython already addresses this for
SeqRecord.reverse_complement()
), though I don't know how to implement this properly in__getitem__()
.