For those care about the running time

Minys233 commented 1 year ago

Hi researchers and developers there. Recently, I'm using PDBFixer to fix protein structures in a large scale, but I find that for large proteins, the PDBFixer runs extremely slow.

After spending hours on the source code, I discovered that this is caused by the low-performance implementation of the PDBFixer._findNearestDistance(). For details, I used package heartrate to generate a code bottleneck plot, which is shown here.

Compared to other function calls, PDBFixer._findNearestDistance() takes the majority of the CPU running time. the numbers between lineno. are hit times for running code. I use a case, 5mcp, a large protein which is directly download from RCSB PDB, to validate the function and the output of this piece:

from pdbfixer import PDBFixer
from openmm import app
def fix_protein(pfile_in, pfile_out):
    try:
        fixer = PDBFixer(filename=str(pfile_in))
        fixer.findMissingResidues()
        fixer.findNonstandardResidues()
        fixer.replaceNonstandardResidues()
        fixer.removeHeterogens() # done by pymol
        fixer.findMissingAtoms()
        fixer.addMissingAtoms()
        # fixer.addMissingHydrogens(7.0) # this will make about 2 times slower
        with open(pfile_out, 'w') as fout:
            app.PDBFile.writeFile(fixer.topology, fixer.positions, fout)
        return None
    except Exception as e:
        return e

fix_protein('/home/minys/5mcp.pdb', './test.pdb')

But the problem here is simple, just find the minimum distance between atom and other atoms which are not in the same residue. This can be efficiently done by KDTree implemented in scipy. For a simple patch, the code goes like this.

from pdbfixer import PDBFixer
from openmm import app
import openmm.unit as unit
from scipy.spatial import KDTree
import numpy as np
import sys

def _findNearestDistance(self, context, topology, newAtoms):
    positions = context.getState(getPositions=True).getPositions(asNumpy=True).value_in_unit(unit.nanometer)
    atomResidue = [atom.residue for atom in topology.atoms()]
    kdtree = KDTree(positions[:len(atomResidue)],)
    if not hasattr(self, 'yaosen_inresidue'):
        self.yaosen_inresidue = dict()
        for atom in newAtoms:
            self.yaosen_inresidue[atom.index] = np.array([i.index for i in atom.residue.atoms()])
    nearest = sys.float_info.max
    # if nearest < 0.13, then refine, thus we only find the near points, say 2.0.
    for atom in newAtoms:
        query = positions[atom.index]
        neighbor_index = np.array(kdtree.query_ball_point(query, 2.0, workers=32))
        neighbor_index_mask = np.isin(neighbor_index, self.yaosen_inresidue[atom.index], invert=True)
        neighbor_index = neighbor_index[neighbor_index_mask]
        # only calculate the distance between positions[neighbor_index] and query = positions[atom.index]
        if len(neighbor_index) == 0: continue
        dist = np.sqrt(np.min(np.sum((positions[neighbor_index] - query)**2, axis=1)))
        if dist < nearest:
            nearest = dist
    return nearest

PDBFixer._findNearestDistance = _findNearestDistance

def fix_protein(pfile_in, pfile_out):
    try:
        fixer = PDBFixer(filename=str(pfile_in))
        fixer.findMissingResidues()
        fixer.findNonstandardResidues()
        fixer.replaceNonstandardResidues()
        fixer.removeHeterogens() # done by pymol
        fixer.findMissingAtoms()
        fixer.addMissingAtoms()
        # fixer.addMissingHydrogens(7.0) # this will make about 2 times slower
        with open(pfile_out, 'w') as fout:
            app.PDBFile.writeFile(fixer.topology, fixer.positions, fout)
        return None
    except Exception as e:
        return e

fix_protein('/home/minys/5mcp.pdb', './test.pdb')

On this protein, the running time boost could be over 20 times (29.248s vs >10min). However, I only tested this code on this one protein, by debugging and compare the results of two functions. I think more tests are needed, and I here share the code to anyone who is in a hurry :D

peastman commented 1 year ago

Thanks so much for tracking this down! It would be a great thing to optimize.

Ruibin-Liu commented 1 year ago

fixer.removeHeterogens() # done by pymol

Did you also hijack that function as well by using pymol? If so, what problems did you try to solve?

Minys233 commented 1 year ago

fixer.removeHeterogens() # done by pymol
Did you also hijack that function as well by using pymol? If so, what problems did you try to solve?

Its trival, just select everything that does not belong to the protein, just like the original code at here, keep protein, rna and dna residues and delete everything else. https://github.com/openmm/pdbfixer/blob/db2886903fe835919695c465fd20a9ae3b2a03cd/pdbfixer/pdbfixer.py#L1006-L1010

openmm / pdbfixer

For those care about the running time #264