volkamerlab / opencadd

A Python library for structural cheminformatics
https://opencadd.readthedocs.io
MIT License
89 stars 18 forks source link

Slow performance of atom selection with MDAnalysis #47

Open dominiquesydow opened 3 years ago

dominiquesydow commented 3 years ago

We are using the opencadd.structure.superposition module in TeachOpenCADD talktorial 010 and observed that selecting atoms with MDAnalysis (used in that module) is slow.

See PR: https://github.com/volkamerlab/TeachOpenCADD/pull/44

cProfile / snakeviz

Profiled code

import pandas as pd

from MDAnalysis.analysis import rms

from opencadd.structure.core import Structure
from opencadd.structure.superposition.engines.mda import MDAnalysisAligner

def calc_rmsd(A, B):
    """
    Calculate RMSD between two structures.

    Parameters
    ----------
    A : opencadd.structure.core.Structure
        Structure A.
    B : opencadd.structure.core.Structure
        Structure B.

    Returns
    -------
    float
        RMSD value.
    """
    aligner = MDAnalysisAligner()
    selection, _ = aligner.matching_selection(A, B)
    A = A.select_atoms(selection['reference'])
    B = B.select_atoms(selection['mobile'])
    return rms.rmsd(A.positions, B.positions, superposition=False)

structures = [Structure.from_pdbid(pdb_id) for pdb_id in ["3w2s", "3poz"]]
proteins = [Structure.from_atomgroup(s.select_atoms("protein")) for s in structures]
calc_rmsd(proteins[0], proteins[1])

Profile

image

In MDAnalysis.core.selection, the fnmatch package is used to look up the atoms (for atoms selection). Find out if we can cache the atom selection for superposition to be a fit faster.