Closed Minys233 closed 1 year ago
Thanks so much for tracking this down! It would be a great thing to optimize.
fixer.removeHeterogens() # done by pymol
Did you also hijack that function as well by using pymol
? If so, what problems did you try to solve?
fixer.removeHeterogens() # done by pymol
Did you also hijack that function as well by using
pymol
? If so, what problems did you try to solve?
Its trival, just select everything that does not belong to the protein, just like the original code at here, keep protein, rna and dna residues and delete everything else. https://github.com/openmm/pdbfixer/blob/db2886903fe835919695c465fd20a9ae3b2a03cd/pdbfixer/pdbfixer.py#L1006-L1010
Hi researchers and developers there. Recently, I'm using PDBFixer to fix protein structures in a large scale, but I find that for large proteins, the PDBFixer runs extremely slow.
After spending hours on the source code, I discovered that this is caused by the low-performance implementation of the
PDBFixer._findNearestDistance()
. For details, I used packageheartrate
to generate a code bottleneck plot, which is shown here.Compared to other function calls,
PDBFixer._findNearestDistance()
takes the majority of the CPU running time. the numbers between lineno. are hit times for running code. I use a case, 5mcp, a large protein which is directly download from RCSB PDB, to validate the function and the output of this piece:But the problem here is simple, just find the minimum distance between
atom
and other atoms which are not in the same residue. This can be efficiently done byKDTree
implemented inscipy
. For a simple patch, the code goes like this.On this protein, the running time boost could be over 20 times (29.248s vs >10min). However, I only tested this code on this one protein, by debugging and compare the results of two functions. I think more tests are needed, and I here share the code to anyone who is in a hurry :D