Duplicate sample speed - Githubissues

pepkit / peppy

Project metadata manager for PEPs in Python

https://pep.databio.org/peppy

BSD 2-Clause "Simplified" License

37 stars 13 forks source link

Duplicate sample speed #439

Closed neil-phan closed 1 year ago

neil-phan commented 1 year ago

The current issue with the duplicate sample id search was that it would utilize the list .count() method to find duplicates, resulting it to be O(n^2) time complexity in worst case scenarios.

https://github.com/pepkit/peppy/blob/f2c7109fabf818e943c1bfd2f9e0c848b955d92e/peppy/project.py#L637-L649

To reduce this to an O(n) pass, a set can be used to check whether or not there is a duplicate in O(1) time.

https://github.com/pepkit/peppy/blob/a48aee4261210e7bdcefd20131c1225b47bda2c0/peppy/project.py#L637-L647

neil-phan commented 1 year ago

Hmm I'll have to look at what went wrong with the github actions check, but all the tests passed on my local machine.

nsheff commented 1 year ago

That's just a lint check. if you run black it will solve it.

nsheff commented 1 year ago

Looks great! Maybe just rename result to duplicated_ids to be more descriptive?

nsheff commented 1 year ago

To test the speed, I made a dummy table with 50,000 rows. In 0.35.4, it takes 37.6 seconds on my computer to create a peppy.Project object. With this fix, it takes 6.1 seconds. Very nice!

python -c 'import time; import peppy; start = time.time(); p1 = peppy.Project("bigtable.csv"); end = time.time(); print(end-start)'

Version 0.35.4: 37.666146993637085 This branch: 6.145094156265259