modernatx / seqlike

Unified biological sequence manipulation in Python
https://modernatx.github.io/seqlike
Apache License 2.0
207 stars 21 forks source link

High Memory Footprint Due to (mostly?) Unused Attribute #84

Open Yashrajsinh-Jadeja opened 6 months ago

Yashrajsinh-Jadeja commented 6 months ago

Description

I've noticed that the _per_letter_annotations attribute in the SeqLike class is not actively used in the package by many. However, it is contributing significantly to the memory footprint of all objects of this class. This issue can affect performance, especially in environments where resource efficiency is critical. (particularly memory)

For example, if we take a 201 character long nucleotide string,

import random
seed = 33 #Setting seed for reproduceability
nt_length = 201 #Sequence length
random.seed(seed) 
letters = ["A","T","G","C"] #Picking ATGC DNA nucleotide characters
nt_seq = ''.join(random.choice(letters) for _ in range(nt_length)) #Creating a random string

And create a nucleotide SeqLike object of the nucleotide string nt_seq,

from seqlike import SeqLike
seq_obj = SeqLike(sequence=nt_seq,seq_type="NT") #Creating a seqlike object

Looking at the memory footprint of the seq_obj using pympler

from pympler import asizeof
print("Size of SeqLike Object:", asizeof.asizeof(seq_obj), "bytes") #Looking at size of the object
Size of SeqLike Object: 19328 bytes

Further digging into the memory footprint of the object by unpacking attributes,

from pympler import asizeof

def get_attribute_sizes(obj, path='', visited=None, sizes=None):
    if visited is None:
        visited = set()
    if sizes is None:
        sizes = {}

    obj_id = id(obj)
    if obj_id in visited:
        return sizes
    visited.add(obj_id)

    # Calculate the size and store it if not zero
    obj_size = asizeof.asizeof(obj)
    if obj_size > 0:
        sizes[path if path else 'self'] = obj_size

    # Handle different types of collections and objects
    if hasattr(obj, '__dict__'):
        for attr, value in obj.__dict__.items():
            full_path = f"{path}.{attr}" if path else attr
            get_attribute_sizes(value, full_path, visited, sizes)
    elif isinstance(obj, dict):
        for key, value in obj.items():
            full_path = f"{path}.{key}" if path else str(key)
            get_attribute_sizes(value, full_path, visited, sizes)
    elif isinstance(obj, (list, set, tuple)):
        for index, item in enumerate(obj):
            full_path = f"{path}[{index}]" if path else f"[{index}]"
            get_attribute_sizes(item, full_path, visited, sizes)
    return sizes

attribute_sizes = get_attribute_sizes(seq)

Plotting top 20 attributes

image

We see that _per_letter_annotations makes up a sizeable chunk of the _nt_record attribute, 13472 bytes to be precise.

Further dissecting the _per_letter_annotations attribute, we can see that it is a dictionary with 1 key (seqnums) value pair and the values are a single list with string elements that are presumably indices that go up to the length of the sequence.

print(seq_obj._nt_record._per_letter_annotations.keys()) #See keys of the dictionary
print(seq_obj._nt_record._per_letter_annotations["seqnums"]) #Focus on values of the dictionary
print(type(seq_obj._nt_record._per_letter_annotations["seqnums"][0])) #See data type of the element in the list
dict_keys(['seqnums'])
['1', '2', '3', '4', '5', '6', '7', '8', '9' .... '201']
<class 'str'>

By setting the seq_obj._nt_record._per_letter_annotations to None we can see a considerable reduction in memory occupied by the object

from pympler import asizeof
seq_obj._nt_record._per_letter_annotations = None
print("Size of SeqLike Object:", asizeof.asizeof(seq_obj), "bytes")
Size of SeqLike Object: 5856 bytes

image

Comments

  1. This is a reduction in memory footprint of this one object by ~70%.

  2. I have observed similar behavior for the _aa_record._per_letter_annotations as well. So the same still applies for objects created as an AA record instead of an NT record.

  3. The memory bloat can add up significantly over time and can be a critical limiting factor (memory-wise) especially for large machine-learning/computational biology data processing and analysis applications.

Expected Behavior

Objects of the SeqLike class should not allocate memory for attributes that are not used, thereby reducing the overall memory footprint of the application.

Current Behavior

Currently, every instance of the SeqLike class includes the _per_letter_annotations attribute, which increases the memory usage unnecessarily.

Possible Solution

One temporary potential solution to address this issue is to set the _per_letter_annotations attribute to None after its last necessary use, or entirely remove this attribute if it is confirmed to be redundant.

Alternative solutions may include looking at (line 1082 in particular)

https://github.com/modernatx/seqlike/blob/dde761ced5e3dcf86010d1e50abc3b268f794d8f/seqlike/SeqLike.py#L1070-L1083

and modifying the function behavior where per letter annotations are only added based on a condition as opposed to being added by default.

ericmjl commented 6 months ago

@ndousis SeqLike memory usage has often been a limiting factor in its usage, and I think @Yashrajsinh-Jadeja has definitely done a great job here diagnosing the matter. Do you foresee it being a breaking change for you if we were to chnage the behavior of SeqLike to make seqnums optional, with a default behaviour of not populating?

Yashrajsinh-Jadeja commented 6 months ago

One comment I would like to add is that this behavior is not observed when storing SeqLike objects in a pandas series, I don't really have a good idea as to why this may be happening, but it could be how pandas compresses, serializes and optimizes different data types. I haven't looked into it as deeply as I have for native Python data structures (like a list) but could be related to deepcopy or passing and storing references of objects; but a curious case nonetheless.

ndousis commented 6 months ago

@Yashrajsinh-Jadeja thank you for this very thoughtful analysis and proposal. The attribute _per_letter_annotations is derived from BioPython's SeqRecord class and carries special functionality when slicing, splitting, and concatenating sequences.

These letter annotations can carry useful metadata that is preserved when manipulating sequence alignments, and so we've gone through extra pains to carry this information over when building alignments using tools such as MAFFT.

However, the attribute letter_annotations defaults to None in SeqRecord, so it may be worth seeing if the proposed solution breaks any critical SeqLike unit tests. My workflows depend on SeqRecord functionality, but it appears that the proposal is compatible with this functionality, so let's give it a try.