pepkit / peppy

Project metadata manager for PEPs in Python
https://pep.databio.org/peppy
BSD 2-Clause "Simplified" License
37 stars 13 forks source link

Performance idea: find duplicate samples more efficiently #436

Closed nsheff closed 1 year ago

nsheff commented 1 year ago

Raised in https://github.com/pepkit/peppy/issues/432#issuecomment-1480989617

I actually think the most important performance-related problem is not actually storing the samples in two ways, but in the way duplicate sample name are identified and merged, which is extremely inefficient.

This looks like an N^2 approach, since I'm not sure but I bet pythons List.count() function has to go through all the items in the list:

https://github.com/pepkit/peppy/blob/cac87fb91958d77f89c7ad76108c2efbf67a58e6/peppy/project.py#L637-L649

and then samples are looped through again here:

https://github.com/pepkit/peppy/blob/cac87fb91958d77f89c7ad76108c2efbf67a58e6/peppy/project.py#L590

and I think other times. So there's some algorithmic issues. This should be able to be accomplished in 1 linear pass through the sample objects. Can probably be done very quickly using pandas, but even if using sample objects just a single loop should probably work.

Basically just fixing the counting to go through and count once would probably be a huge speed benefit, and should be really simple to implement.

nsheff commented 1 year ago

@neil-phan this is what we discussed

nsheff commented 1 year ago

This is now released on 0.35.5.