Open Yashrajsinh-Jadeja opened 6 months ago
@ndousis SeqLike memory usage has often been a limiting factor in its usage, and I think @Yashrajsinh-Jadeja has definitely done a great job here diagnosing the matter. Do you foresee it being a breaking change for you if we were to chnage the behavior of SeqLike to make seqnums
optional, with a default behaviour of not populating?
One comment I would like to add is that this behavior is not observed when storing SeqLike
objects in a pandas
series, I don't really have a good idea as to why this may be happening, but it could be how pandas
compresses, serializes and optimizes different data types. I haven't looked into it as deeply as I have for native Python data structures (like a list) but could be related to deepcopy
or passing and storing references of objects; but a curious case nonetheless.
@Yashrajsinh-Jadeja thank you for this very thoughtful analysis and proposal. The attribute _per_letter_annotations
is derived from BioPython's SeqRecord
class and carries special functionality when slicing, splitting, and concatenating sequences.
These letter annotations can carry useful metadata that is preserved when manipulating sequence alignments, and so we've gone through extra pains to carry this information over when building alignments using tools such as MAFFT.
However, the attribute letter_annotations
defaults to None
in SeqRecord
, so it may be worth seeing if the proposed solution breaks any critical SeqLike
unit tests. My workflows depend on SeqRecord
functionality, but it appears that the proposal is compatible with this functionality, so let's give it a try.
Description
I've noticed that the
_per_letter_annotations
attribute in theSeqLike
class is not actively used in the package by many. However, it is contributing significantly to the memory footprint of all objects of this class. This issue can affect performance, especially in environments where resource efficiency is critical. (particularly memory)For example, if we take a 201 character long nucleotide string,
And create a nucleotide SeqLike object of the nucleotide string
nt_seq
,Looking at the memory footprint of the
seq_obj
using pymplerFurther digging into the memory footprint of the object by unpacking attributes,
attribute_sizes = get_attribute_sizes(seq)
Plotting top 20 attributes
We see that
_per_letter_annotations
makes up a sizeable chunk of the_nt_record
attribute, 13472 bytes to be precise.Further dissecting the
_per_letter_annotations
attribute, we can see that it is a dictionary with 1 key (seqnums
) value pair and the values are a single list with string elements that are presumably indices that go up to the length of the sequence.By setting the
seq_obj._nt_record._per_letter_annotations
toNone
we can see a considerable reduction in memory occupied by the objectComments
This is a reduction in memory footprint of this one object by ~70%.
I have observed similar behavior for the
_aa_record._per_letter_annotations
as well. So the same still applies for objects created as an AA record instead of an NT record.The memory bloat can add up significantly over time and can be a critical limiting factor (memory-wise) especially for large machine-learning/computational biology data processing and analysis applications.
Expected Behavior
Objects of the
SeqLike
class should not allocate memory for attributes that are not used, thereby reducing the overall memory footprint of the application.Current Behavior
Currently, every instance of the
SeqLike
class includes the_per_letter_annotations
attribute, which increases the memory usage unnecessarily.Possible Solution
One temporary potential solution to address this issue is to set the
_per_letter_annotations
attribute toNone
after its last necessary use, or entirely remove this attribute if it is confirmed to be redundant.Alternative solutions may include looking at (line 1082 in particular)
https://github.com/modernatx/seqlike/blob/dde761ced5e3dcf86010d1e50abc3b268f794d8f/seqlike/SeqLike.py#L1070-L1083
and modifying the function behavior where per letter annotations are only added based on a condition as opposed to being added by default.