openmm / pdbfixer

PDBFixer fixes problems in PDB files
Other
464 stars 114 forks source link

Speed up _findNearestDistance #267

Closed edwag closed 1 year ago

edwag commented 1 year ago

This pull request relates to #264

I've put together a faster implementation of the _findNearestDistance method. I haven't tested or benchmarked this thoroughly, but I've measured a 10-100x speedup depending on the input.

peastman commented 1 year ago

This implementation speeds it up by a constant factor, but it still scales as O(M*N) where M is the number of new atoms and N is the number of total atoms. The implementation suggested in #264 used a kd-tree to improve the scaling to O(M*log(N)). There's also a CellList class in Modeller we could use to improve it to O(M). For large models, that should be a lot faster.

This implementation also uses a lot of memory. If M=1000 and N=100,000, your differences and squaredDifferences arrays will each take 1.2 GB. That could be limiting for large models.

Let me see if I can come up with something using CellList.

peastman commented 1 year ago

Can you take a look at #268 and see if it looks ok to you? How is the performance on your model?

edwag commented 1 year ago

Thanks for such a fast and thoughtful response to this PR, and thanks also for your _CellList implementation! #268 looks good to me and I've left a couple of comments on that PR. In my application, performance is massively improved over the old implementation, though I haven't properly benchmarked it so I can't give a quantitative answer.

edwag commented 1 year ago

Closing because #264 is better addressed by #268