pepkit / peppy

Project metadata manager for PEPs in Python
https://pep.databio.org/peppy
BSD 2-Clause "Simplified" License
37 stars 13 forks source link

Performance idea: generator for object-style representation #432

Open nsheff opened 1 year ago

nsheff commented 1 year ago

In the past we've raised some issues about peppy performance (See #388 #387). Peppy is fine for small projects (hundreds or even thousands of sample rows, but it gets slow when we are dealing with huge projects, like tens to hundreds of thousands of samples.

It would be nice if peppy could handle these very large projects.

One of the problems is that peppy is storing sample information in two forms: a table (as a pandas data frame object), and as a list of Sample objects. This is duplicating the information in memory.

An idea for improving the performance could be to switch to a single-memory model. But we really want to be able to access the metadata in both ways for different use cases... so what about using the pandas data.frame as the main data structure, and then providing some kind of a generator that could go through it and create objects on the fly, in case someone wants the list-based approach?

This could be one way to increase performance.

nsheff commented 1 year ago

Here's some proof of concept code on how to do this.

import pandas
st[1,]

# Read in sample table
st = pandas.read_csv("sample_table.csv")

# This is a generator that allows you to loop through a pandas DF, but gives you sample objects.
def gen_row():
    i=0
    while i < len(st.index):
        yield st.loc[i,].to_dict()
        i += 1

Use like:

[row for row in gen_row()]

Here's an alternative approach that uses a namedtuple instead of a dict:

from collections import namedtuple

Sample = namedtuple('Sample', st.dtypes.index.tolist())
s1 = Sample(*st.iloc[0,:])

def gen_row_obj():
    i=0
    while i < len(st.index):
        d = dict(zip(s1.dtype.names, r0))
        yield d
        i += 1

Could also think about reading in the file line-by-line as well:

def csv_reader(file_name):
    for row in open(file_name, "r"):
        yield row

But I think we really can't do this due to duplicate sample names in a table, which requires processing. Also not sure that would really be a benefit due to added overhead of filereading or losing vector processing, since each sample would need to be processed individually.

nsheff commented 1 year ago

After looking into this more, I actually think the most important performance-related problem is not actually storing the samples in two ways, but in the way duplicate sample name are identified and merged, which is extremely inefficient.

This looks like an N^2 approach, since I'm not sure but I bet pythons List.count() function has to go through all the items in the list:

https://github.com/pepkit/peppy/blob/cac87fb91958d77f89c7ad76108c2efbf67a58e6/peppy/project.py#L637-L649

and then samples are looped through again here:

https://github.com/pepkit/peppy/blob/cac87fb91958d77f89c7ad76108c2efbf67a58e6/peppy/project.py#L590

and I think other times. So there's some algorithmic issues. This should be able to be accomplished in 1 linear pass through the sample objects. Can probably be done very quickly using pandas, but even if using sample objects just a single loop should probably work.

Basically just fixing the counting to go through and count once would probably be a huge speed benefit, and should be really simple to implement.

nsheff commented 1 year ago

We should address #436 first, since that's both easier to do, and probably more impactful.