Abnormal memory usage in protocaller

kexul commented 3 years ago

Hi, I installed the new 1.2.0 version of protocaller, It seems like there were some bug caused abnormal memory usage. The following code, which can run normally with old version, consumpt all of memory in my machine (8G), then killed by the system.

example_code.zip

msuruzhon commented 3 years ago

Hi, the code works for me in version 1.2.0, and I also have 8 GB RAM. That being said, it does take a long time and a lot of memory due to the structural differences between the reference crystal ligand and the ligands of interest. This means that the maximum common substructure algorithm needs a lot of computational power to come up with structures that might be obvious to us, or even worse, find another solution to the problem that we are not interested in. That's why I would highly recommend engineering a custom reference ligand (with e.g. Avogadro) which is more similar to your ligands of interest and the code shouldn't struggle in this case.

Otherwise, I suspect that I could optimise the memory consumption and/or the speed of the code in the case of a relatively small common substructure but that part of the code is very complicated and I don't know when I will get the time to do that, and I am also conscious about introducing new bugs to that part of the code.

kexul commented 3 years ago

Hi @msuruzhon , how about use a core-constrained docking here. In Merck KGaa's paper , they switched from flexible ligand alignment to core-constrained docking in most cases. I've found a nice implement by rdkit and rdock.

kexul commented 3 years ago

Here is my alignment result generate by protocaller 1.2.0. Should it be considered as a good alignment?

By the way , this is the result of older version of protocaller.

Here are the .gro files. morph.zip

msuruzhon commented 3 years ago

This alignment is an artifact of the current MCS algorithm not handling the (known) case where the stereochemical label of an atom changes between molecules, resulting in a diminished MCS, and therefore bad alignment. It seems that this case should be handled properly, but it will take some time for me to implement it. Until then, a practical solution for that is to create a perturbation "manually" by discarding stereochemical mapping, e.g.:

from ProtoCaller.Ensemble import Perturbation
pert = Perturbation(ligands_proto[0], ligands_proto[1])
pert.alignToReference(protein_proto.ligand_ref)
pert.alignToEachOther(mcs_parameters=dict(keep_stereo=False))
morphs = [pert]

and then pass the morphs to the Ensemble constructor. This should hopefully work for now until I implement a fix at some point.

As for docking, this is certainly an interesting functionality, but it's unfortunately not something I can devote much attention to. If you want to experiment with it, you can certainly extract the parametrised complex_template object from the parametrised protein and save it as a pdb, which you can then use for docking. Afterwards, you can use the resulting ligand coordinates to create the Ligand object, and use the ligand as its own reference, so that no alignment is actually done. I hope this makes sense somewhat. If you manage to figure out how to do a nice workflow, you are welcome to create a pull request for me to review in the future.

kexul commented 3 years ago

I'm glad to implement a docking based workflow, I'll create a pull request as soon as possible.

By the way, the code to disable stereo mapping raise an error here:

INFO:numexpr.utils:NumExpr defaulting to 4 threads.
INFO:root:Downloading ligand files from the Protein Data Bank...
INFO:root:Parametrising ligand ligand0...
WARNING:root:Cannot parametrise unprotonated ligand. Protonating first with default parameters...
INFO:root:Running OpenBabel... 
INFO:root:Running OpenBabel... 
INFO:root:Running antechamber... 
INFO:root:Running parmchk2... 
INFO:root:Running tleap... 
INFO:root:Parametrising ligand ligand1...
WARNING:root:Cannot parametrise unprotonated ligand. Protonating first with default parameters...
INFO:root:Running OpenBabel... 
INFO:root:Running OpenBabel... 
INFO:root:Running antechamber... 
INFO:root:Running parmchk2... 
INFO:root:Running tleap... 
INFO:root:Running PDB2PQR... 
INFO:root:Running OpenBabel... 
INFO:root:Parametrising original crystal system...
INFO:root:Running tleap... 
INFO:root:Running tleap... 
INFO:root:Creating morph ligand0~ligand1...
INFO:root:Morph ligand0~ligand1 is already aligned to a reference
INFO:root:Morph ligand0~ligand1 is already aligned to a reference
Traceback (most recent call last):
  File "prepare.py", line 49, in <module>
    prepare_system(ligands)
  File "prepare.py", line 43, in prepare_system
    system.prepareComplexes()
  File "/root/miniconda3/envs/uii/lib/python3.7/site-packages/ProtoCaller/Ensemble/__init__.py", line 203, in prepareComplexes
    morph_BSS, mcs =  morph.alignAndCreateMorph(self.protein.ligand_ref)
TypeError: cannot unpack non-iterable NoneType object

Here is my code:

import logging
import os
logging.basicConfig(level=logging.INFO)

from ProtoCaller.Utils.fileio import Dir
from ProtoCaller.Ensemble import Ligand, Protein, Ensemble, Perturbation

def prepare_system(ligands) -> None: 
    """
    pdb_file: cocrystal file name 
    ligands: list of ligands file name
    """
    with Dir('SYSTEM', overwrite=True):
        # create a protein from its PDB code and the residue number of the ligand
        # we are going to use for mapping
        #lig = Ligand('/proto/ref.pdb', minimise=False, protonated=True)
        protein_proto = Protein('3eyg', ligand_ref='1')

        ligands_proto = []
        for idx, item in enumerate(ligands):
            li = Ligand(item, protonated=False, minimise=False, name=f"ligand{idx}", workdir="Ligands")
            ligands_proto.append(li)

        # create the morphs from the ligands
        pert = Perturbation(ligands_proto[0], ligands_proto[1])
        pert.alignToReference(protein_proto.ligand_ref)
        pert.alignToEachOther(mcs_parameters=dict(keep_stereo=False))
        morphs = [pert]

        # create a system from the protein and the morphs and set up some default
        # settings regarding system preparation
        system = Ensemble("GROMACS", protein=protein_proto, morphs=morphs,
                          box_length_complex=7, ligand_ff="gaff2",
                          workdir=protein_proto.workdir.path)

        # only keep the reference ligand and keep all crystallographic waters
        system.protein.filter(ligands=None, waters="all")
        # protonate the protein using PDB2PQR and add missing atoms / residues if
        # needed
        system.protein.prepare()
        # prepare the complex and solvated leg starting structures and save them as
        # GROMACS input
        system.prepareComplexes()

if __name__ == '__main__':
    ligands = ['/proto/mola.pdb', 
            '/proto/molb.pdb']

    prepare_system(ligands)

msuruzhon commented 3 years ago

This error is due to some artifacts in the code back when I was refactoring that class. I have created some new commits which you can download from the essexlab/label/dev conda channel under the developmental 1.2.1 version. That being said, I have found another issue with the alignment, so the best bet until i fix it for this case is to turn off the recursive bond breaking algorithm (which is supposed to grow the MCS structure to a bigger physical one) during the first part of the alignment, namely:

from ProtoCaller.Ensemble import Perturbation
pert = Perturbation(ligands_proto[0], ligands_proto[1])
pert.alignToReference(protein_proto.ligand_ref, mcs_parameters=dict(break_recursively=False))
pert.alignToEachOther(mcs_parameters=dict(keep_stereo=False))
morphs = [pert]

This molecule seems to present a lot of problems for the software, so this makes it a very useful test case when the bugs are finally addressed. I will probably add it to the unit tests.

kexul commented 3 years ago

Hi, I updated my protocaller and used the code you provided, the alignment seems perfect now! Thank you so much!

But, it seems like, some abnormal bonds showed in the ligand in morph.gro and complex_final.gro. For example: atom 46 and 13; atom 46 and 14. The following mdrun failed with LINCS warning complaining about atom distance.

Here is my output files. 3eyg.zip

Is it reasonable? Is there any clue to diagnose the morph.gro and complex_final.gro ? Since the alignment file was temporary file that were removed in the output folder in the normal pipeline.

msuruzhon commented 3 years ago

This is strange, because I get the following alignment when I run it: snap

These are the output files + the input script: Files.zip

Maybe you can find the differences between your script and mine? Otherwise this doesn't make any sense.

Also, when you call the alignment method manually, it should create aligned PDB and INPCRD files in the folder you are calling it from, so you can visualise them.

kexul commented 3 years ago

Hi, @msuruzhon , I've checked the difference between our results. The alignment result of ligand1, i.e. molb was similar. But, in my result, atom 30 was close to atom 9 and atom 42, their distances were 0.9 angstrom. In your result, the same residual was close to the heterocycle, but the closest distance, i.e. the distance between atom 31 and atom 10 was 1.4 angstrom.

Here is my result of alignment. alignment.zip

Most of our code are identical, except you parametise the ligands before createing the system. Here is my code used to generate the system:

import logging
import os
logging.basicConfig(level=logging.INFO)

from ProtoCaller.Utils.fileio import Dir
from ProtoCaller.Ensemble import Ligand, Protein, Ensemble, Perturbation

def prepare_system(ligands) -> None: 
    """
    pdb_file: cocrystal file name 
    ligands: list of ligands file name
    """
    with Dir('SYSTEM', overwrite=True):
        # create a protein from its PDB code and the residue number of the ligand
        # we are going to use for mapping
        #lig = Ligand('/proto/ref.pdb', minimise=False, protonated=True)
        protein_proto = Protein('3eyg', ligand_ref='1')

        ligands_proto = []
        for idx, item in enumerate(ligands):
            li = Ligand(item, protonated=False, minimise=False, name=f"ligand{idx}", workdir="Ligands")
            ligands_proto.append(li)

        # create the morphs from the ligands
        pert = Perturbation(ligands_proto[0], ligands_proto[1])
        pert.alignToReference(protein_proto.ligand_ref, mcs_parameters=dict(break_recursively=False))
        pert.alignToEachOther(mcs_parameters=dict(keep_stereo=False))
        morphs = [pert]

        # create a system from the protein and the morphs and set up some default
        # settings regarding system preparation
        system = Ensemble("GROMACS", protein=protein_proto, morphs=morphs,
                          box_length_complex=7, ligand_ff="gaff2",
                          workdir=protein_proto.workdir.path)

        # only keep the reference ligand and keep all crystallographic waters
        system.protein.filter(ligands=None, waters="all")
        # protonate the protein using PDB2PQR and add missing atoms / residues if
        # needed
        system.protein.prepare()
        # prepare the complex and solvated leg starting structures and save them as
        # GROMACS input
        system.prepareComplexes()

if __name__ == '__main__':
    ligands = ['/proto/mola.pdb', 
            '/proto/molb.pdb']

    prepare_system(ligands)

I think the difference may due to the randomness of rdkit minimization outside of the MCS part.

msuruzhon commented 3 years ago

It does seem to be the case that the minimisation algorithm introduces randomness - I also get different results from different runs. I guess that's another argument for designing custom reference ligands, as alignment can't always be trusted.

kexul commented 3 years ago

Another bad case where default parameters will not work correctly. test_align.zip

Here is the minimum code to reproduce, it will consume all of memory in my machine.

from ProtoCaller.Ensemble import Ligand, Perturbation

lig1 = Ligand('mol810_protonated.sdf', protonated=True, minimise=False,
                      parametrised_files=['mol810.prmtop', 'mol810.inpcrd'])
lig2 = Ligand('mol812_protonated.sdf', protonated=True, minimise=False,
              parametrised_files=['mol812.prmtop', 'mol812.inpcrd'])
pert = Perturbation(lig1, lig2)
pert.alignToEachOther()

When using mcs_parameters=dict(break_recursively=False), it will work correctly.

msuruzhon commented 3 years ago

This won't be straightforward to fix, and I am considering turning off the recursive bond breaking algorithm by default until it is viable.

kexul commented 3 years ago

Hi @msuruzhon , recently, I tried the rmsdAlign function of BioSimSpace, it produced some good results and may be a good fallback plan when ProtoCaller couldn't do well in some cases.

It's easy to implement it thanks to the modular design of ProtoCaller, here is the code snippet I've used:

class BSSPerturbation(Perturbation):
    def __init__(self, ligand1, ligand2, name=None):
        Perturbation.__init__(self, ligand1, ligand2, name)

    def alignAndCreateMorph(self, ref):
        _runexternal.runExternal(f'obabel {ref.protonated_filename} -O ref.pdb')
        ref_bss = _BSS.IO.readMolecules('ref.pdb').getMolecule(0)

        lig1_bss = _BSS.IO.readMolecules(self.ligand1.parametrised_files).getMolecule(0)
        lig2_bss = _BSS.IO.readMolecules(self.ligand2.parametrised_files).getMolecule(0)
        lig1_aligned = _BSS.Align.rmsdAlign(lig1_bss, ref_bss)
        lig2_aligned = _BSS.Align.rmsdAlign(lig2_bss, lig1_aligned)
        merged = _BSS.Align.merge(lig1_aligned, lig2_aligned)
        return merged, None

protocaller / ProtoCaller

Abnormal memory usage in protocaller #22