hyanwong / giglib

MIT License
4 stars 2 forks source link

Use numpy structured array for iedge table #108

Closed hyanwong closed 5 months ago

hyanwong commented 6 months ago

Peter Ralph suggested we could store the iedge data in a numpy array. We can wrap the whole thing in a class which will preallocate memory in the right way. We could use the numpy recarray class (which is a wrapper for structured arrays) but I have read that it can be slower for accessing attributes, because of the checking it has to do. I think we can just use structured arrays with a bit of syntactic sugar, and it should be very fast. Here's an outline: I think it might just be a drop-in replacement. I'm hoping that this could make everything much faster:

import collections
import logging
class IedgeTableRow(collections.namedtuple("IedgeTableRow", "child_left, child_right, parent_left, parent_right, child, parent, child_chromosome, parent_chromosome, edge")):
    @property
    def diff(self):
        return self.child_left - self.child_right

class IedgeTable:
    initial_size = 1  # number of rows - change to e.g. 128
    max_resize =  2**18  # maximum number of rows we expand the storage by
    def __init__(self):
        self._datastore = np.empty((self.initial_size), dtype=[(name, np.int64) for name in Row._fields])
        self._data = self._datastore[0: 0]

    def __getitem__(self, index):
        return IedgeTableRow(*self._data[index])

    def _expand_datastore(self):
        print("preallocating new memory ...")  # change to logging.debug
        old = self._datastore
        try:
            self._datastore = np.empty_like(self._datastore, shape=min(2 * len(self._datastore), len(self._datastore) + self.max_resize))
        except MemoryError:
            # should try dumping the table to disk here, I suppose?
            raise
        self._datastore[0:len(old)] = old
        print(f"... allocated space for {len(self._datastore) - len(old)} new rows")  # change to logging.debug
        # Hopefully force garbage collection of old array: but not if the user has a variable
        # accessing the old array (see test case)
        del old

    def add_row(self, child_left, child_right, parent_left, parent_right, *, child, parent, child_chromosome=0, parent_chromosome=0, edge=-1):
        # Check here if size of self._datastore needs increasing. If so, resize to e.g. double the size up to a certain limit
        if len(self._data) == len(self._datastore):
            self._expand_datastore()
        self._datastore[len(self._data)] = (child_left, child_right, parent_left, parent_right, child, parent, child_chromosome, parent_chromosome, edge)
        self._data = self._datastore[0: len(self._data)+1]

    def __len__(self):
        return len(self._data)

    @property
    def _num_preallocated_rows(self):
        return len(self._datastore)

    @property
    def child_left(self):
        return self._data['child_left']
    @property
    def child(self):
        return self._data['child']
    @property
    def parent(self):
        return self._data['parent']
    # Define other properties here

# test it!

t = IedgeTable()
t.add_row(0, 1, 0, 1, child=0, parent=1)
t.add_row(0, 1, 0, 1, child=0, parent=2)
t.add_row(0, 1, 0, 1, child=0, parent=3)
m = t.parent
t.add_row(0, 1, 0, 1, child=0, parent=4)
t.add_row(0, 1, 0, 1, child=0, parent=5)
print(f"{len(t)} rows in table (allocated space for {t._num_preallocated_rows}) rows")
row = t[-1]
print("Last row:", row)
print(row.child)  # attribute accessing
print(row.diff)  # "method" accessing
print("Parents in table:", t.parent)  # array accessing
# but note that `m` is a reference to the "old" array
print(f"kept a reference to {len(m)} rows, indexing into an array of {len(m.base)} rows")

Giving:

preallocating new memory ...
... allocated space for 1 new rows
preallocating new memory ...
... allocated space for 2 new rows
preallocating new memory ...
... allocated space for 4 new rows
5 rows in table (allocated space for 8) rows
Last row: IedgeTableRow(child_left=0, child_right=1, parent_left=0, parent_right=1, child=0, parent=5, child_chromosome=0, parent_chromosome=0, edge=-1)
0
-1
Parents in table: [1 2 3 4 5]
kept a reference to 3 rows, indexing into an array of 4 rows

Note the slight issue at the end that any references to the old arrays will not reflect the original data when a new memory array has been allocated. I guess there's nothing we can do about that, and it's an edge case anyway.

hyanwong commented 5 months ago

https://github.com/hyanwong/GeneticInheritanceGraphLibrary/tree/numpy-edge-table shows that it can be done. We should probably implement this, but it requires a bit of thought to do it in the simplest way.

hyanwong commented 5 months ago

Merged in #116