Closed neil-phan closed 1 year ago
Hmm I'll have to look at what went wrong with the github actions check, but all the tests passed on my local machine.
That's just a lint check. if you run black
it will solve it.
Looks great! Maybe just rename result
to duplicated_ids
to be more descriptive?
To test the speed, I made a dummy table with 50,000 rows. In 0.35.4, it takes 37.6 seconds on my computer to create a peppy.Project
object. With this fix, it takes 6.1 seconds. Very nice!
python -c 'import time; import peppy; start = time.time(); p1 = peppy.Project("bigtable.csv"); end = time.time(); print(end-start)'
Version 0.35.4: 37.666146993637085 This branch: 6.145094156265259
The current issue with the duplicate sample id search was that it would utilize the list .count() method to find duplicates, resulting it to be O(n^2) time complexity in worst case scenarios.
https://github.com/pepkit/peppy/blob/f2c7109fabf818e943c1bfd2f9e0c848b955d92e/peppy/project.py#L637-L649
To reduce this to an O(n) pass, a set can be used to check whether or not there is a duplicate in O(1) time.
https://github.com/pepkit/peppy/blob/a48aee4261210e7bdcefd20131c1225b47bda2c0/peppy/project.py#L637-L647