michellab / BioSimSpace

Code and resources for the EPSRC BioSimSpace project.
https://biosimspace.org
GNU General Public License v3.0
77 stars 19 forks source link

Extremely slow system manipulation in BioSimSpace #259

Closed msuruzhon closed 2 years ago

msuruzhon commented 2 years ago

Hi,

I have been running into a consistent performance issue within BioSimSpace whenever system manipulation is concerned (e.g. instantiation, addition, iteration through molecules etc.). While this performance hit is tolerable for smaller systems, it can result in manipulation times of the order of ~1 hour (and more) on solvated protein-ligand systems.

From what I have seen, the culprit seems to be the constant python-C++ communication for each molecule whenever one iterates over them. Sometimes doing this is inevitable (however I think this will still need to be addressed in the longer run, e.g. HMR on protein-ligand systems takes too long right now). However, for common cases, such as system instantiation from a group of molecules (or addition of two systems), one could compare between two different approaches: the currently implemented for loop in Python, versus a "quick-and-dirty" method which just generates a new System and saves it to a temporary file, which when loaded back in memory results in correct molecule numbers. The second method offloads most of the work to Sire, and is thus ~10 times faster.

I am attaching a script and two types of systems, where one of them (solvated ligand) has a measurable order-of-magnitude decrease in manipulation time, and the other one takes 1 minute with the fast method and ~ 1 hr with the slow (currently used) method:

import tempfile
import time
from pathlib import Path

from Sire import Mol as SireMol
from Sire import System as SireSystem

import BioSimSpace as BSS

def copy_properties(old_system, new_system):
    for prop in old_system._sire_object.propertyKeys():
        val = old_system._sire_object.property(prop)
        new_system._sire_object.setProperty(prop, val)

def fast_system_init(molecules):
    new_system = SireSystem.System("BioSimSpace System")
    molgrp = SireMol.MoleculeGroup("all")
    molgrp.add(molecules._sire_object)
    new_system.add(molgrp)
    new_system = BSS._SireWrappers.System(new_system)

    # We now renumber by saving and realoading through Sire - this is much faster than through Python
    any_perturbable = any(x._is_perturbable for x in molecules)
    with tempfile.TemporaryDirectory() as tempdir:
        base = Path(tempdir)
        if any_perturbable:
            BSS.IO.savePerturbableSystem(f"{base}/temp", new_system)
            new_system = BSS.IO.readPerturbableSystem(
                f"{base}/temp0.prm7",
                f"{base}/temp0.rst7",
                f"{base}/temp1.prm7",
                f"{base}/temp1.rst7",
            )
        else:
            BSS.IO.saveMolecules(f"{base}/temp", new_system, "prm7,rst7")
            new_system = BSS.IO.readMolecules(
                [
                    f"{base}/temp.prm7",
                    f"{base}/temp.rst7",
                ]
            )

    return new_system

old_system = BSS.Stream.load("Serialised_protein.s3")
# Uncomment the next line if you don't want the process to take 1 hr ;)
# old_system = BSS.IO.readMolecules(["system.gro", "system.top"])
molecules = old_system.getMolecules()

print("Generating new system through temporary files...")
t0 = time.time()
new_system = fast_system_init(molecules)
copy_properties(old_system, new_system)
t1 = time.time()
print(f"Process took {t1 -t0} seconds")

print("Generating new system from molecules...")
t0 = time.time()
new_system = BSS._SireWrappers.System(molecules)
copy_properties(old_system, new_system)
t1 = time.time()
print(f"Process took {t1 -t0} seconds")

Since I imagine that asking for a performance increase in the general case would involve too much work right now, is the method I've showed above viable as a hacky solution for now? I need to manipulate the BioSimSpace systems whenever I squash and unsquash the AMBER systems, so doing it the slow way is right now unusable.

lohedges commented 2 years ago

Yes, this is a general issue with BioSimSpace. Typically we take the approach of re-implementing any of the slow code within the C++ layer (Sire) then re-wrapping this to Python. For example, this could be done quite easily for the HMR code.

I agree that the system manipulation has become cumbersome, largely due to the requirement of "fixing" the original system topology and making sure that molecule number etc. is consistent when molecules are added/removed from the system. Your approach is valid and is something that I have used in the past. I'll think about implementing something similar when I get a chance. It might be a bit more complicated when adding molecules to an existing system, although a gap in the numbering should be fine as long as the ordering is correct.

msuruzhon commented 2 years ago

Thanks @lohedges I will implement it in the AMBER branch as a temporary fix which will be called by the squash and the unsquash methods then, if that's okay.

As for adding two systems, I think the above approach also works in this case (so far it hasn't given me any issues while testing), where you call molgrp.add() twice for each of the molecule groups.

lohedges commented 2 years ago

Looking at the existing code, it could easily be improved by not doing most of the operations when the initial system is empty, i.e. when creating a new system.

As for adding two systems, I think the above approach also works in this case (so far it hasn't given me any issues while testing), where you call molgrp.add() twice for each of the molecule groups.

Great. That's good to know.

lohedges commented 2 years ago

Just a quick update on this...

I've now re-written most of the slow functionality from BioSimSpace in pure C++ using Sire, then wrapped this back to Python. This should significantly speed up hydrogen mass repartitioning and any manipulations involving water molecules, e.g. swapping topologies.

For the main problem listed above, i.e. the slow renumbering of constituents to ensure unique atom and residue numbers, while faster in pure C++, this is still significantly slower than the temporary file approach for large systems. For the large example system, it takes roughly 25 seconds going via intermediate files and 650 via the C++ layer. In contrast, the small system is quicker to be processed directly by Sire, taking around half the time.

The main issue is that Sire won't allow updates to molecules that change the numbering, since the UID of the molecule changes. This means that you need to delete and re-add the molecule, which is far slower. Annoyingly this renumbering is only needed since SOMD chooses to identify the perturbable molecule by residue number, which isn't guaranteed to be unique when adding/editing molecules in the Python layer. (As you've found, they are unique when reading a new system from file.)

For now, I suggest that you continue to use your temporary file approach. It isn't a general solution, since we can't guarantee that the system can be written to AMBER format files, i.e. we would have to write to a range of possible formats, that might be lossy. I'll think about editing SOMD so that it can identify the molecule by a different means, e.g. name, which would be easy to make unique. (We could also just renumber the residue with a unique number, since it doesn't affect the ordering of the molecule and isn't used for anything else, I don't think.)

The changes will be pushed after Christmas when I've had time to run some further tests. I've also been working on some general Python tweaks and improvements, which will be pushed at the same time. I'm now off until the New Year, and will be unlikely to look at anything until Jan 5th at the earliest.

Thanks again for all your input over the past few months or so. It's been great to have someone going through the code with a fine-tooth comb.

Cheers!

lohedges commented 2 years ago

Actually, the major problem is that SOMD re-orders (randomly) the molecules in the system. This is why we need to re-map them by unique atom/residue numbers when calling getSystem. The perturbable residue number will be easy to fix so I'll try to figure out how to stop SOMD re-ordering things (other than the perturbable molecule) in the first place.

msuruzhon commented 2 years ago

Thanks for everything Lester, this sounds good. I am currently using the temporary hacky approach in a separate branch (you can find the relevant commit here). The only bits that needed changing are the renumbering using save/load and re-ordering the post-simulation SOMD PDB file with a Python parser before loading it in (see here). With these things fixed, the I/O speed is about 1-2 minutes for a protein-ligand system, which is a reasonable time.

lohedges commented 2 years ago

I've changed the Sire wrappers so that they create a new system when performing edits (as would be done when going via intermediate file) rather than updating the molecules, then needing to delete and re-add when the UID changes. This makes things faster so that the time difference for the large example system is now only a factor of 4, i.e. 25 seconds via intermediate files vs 100 using the Python wrappers. This is probably acceptable and the C++ code code easily be parallelised using TBB too.

Unfortunately we're having some issues with our Azure build timing out (both for Sire and BioSimSpace) and I'll need to resolve this before any changes will be available. Hopefully it's an easy fix, but if it's the result of throttling from Docker/Anaconda then it could be tricky. I won't be able to look at this until after I'm back from holiday on the 17th.

Just a quick note on the fastInit function in your code. You could speed this up further by adding molecules to the molecule group object, then adding this to the system in one go. This is what I do in the C++ code.

Cheers.

msuruzhon commented 2 years ago

Many thanks for that Lester, this sounds good. Does this process still take the full amount of time (e.g. 100 seconds for the large system) even if we add a single molecule to the large system? How does this work during solvation with a large box size (because it was also affected by this performance decrease)?

Cheers.

lohedges commented 2 years ago

When adding a single molecule to a large system it would be much faster since it will only renumber the constituents from the new molecule, since the others will already be correctly numbered. I'll check what happens with solvation. The issue here is that we need to preserve any existing water molecules and ions in the system and make sure that the chosen water model is compatible with these. This involves extracting water molecules from the solvated system and combining them with the original system.

I'm away from this afternoon until Friday so will take a look early next week.

Cheers.

lohedges commented 2 years ago

The Sire updates have now been merged in so BioSimSpace molecular editing performance should be greatly improved. (Along with HMR and water topology transformation.)

Let me know if things are working okay at your end and we can close this issue. Feel free to open other issues if you find other performance blockers. I can work to wrap those within Sire too.

msuruzhon commented 2 years ago

Hi @lohedges I am testing the new updates and it does seem that when I create a new Somd process, it takes far less time than before, even for protein-ligand systems! However, it seems that on loading back in there is still a considerable performance penalty, which seems to be mostly due to the _updateCoordinatesandVelocities() method. Is there anything that can be done about that? I have hacked it by modifying the try/except block of Somd.getSystem() as follows:

      # Try to grab the latest coordinates from the binary restart file.
      try:
          _fastFixSomdPDB(self._restart_file)
          new_system = _IO.readMolecules(self._restart_file)

          # Since SOMD requires specific residue and water naming we copy the
          # coordinates back into the original system.
          old_system = self._system.copy()

          old_pertmol = old_system[0]
          new_pertmol = new_system[0]

          old_system.removeMolecules(old_pertmol)
          new_system.removeMolecules(new_pertmol)

          pertmol = self._updateCoordinates(old_pertmol.toSystem(), new_pertmol.toSystem())

          rest_of_the_system = None
          if old_system.nMolecules():
              with _tempfile.TemporaryDirectory() as tempdir:
                  base = _Path(tempdir)
                  _IO.saveMolecules(f"{base}/temp", old_system, "prm7")
                  _IO.saveMolecules(f"{base}/temp", new_system, "rst7")
                  rest_of_the_system = _IO.readMolecules([f"{base}/temp.rst7", f"{base}/temp.prm7"])

          new_system = _fastSystemInit(pertmol, rest_of_the_system)
          for prop in old_system._sire_object.propertyKeys():
              val = old_system._sire_object.property(prop)
              new_system._sire_object.setProperty(prop, val)

          return new_system

      except:
          return None

with the helper functions defined here. Is there a way to incorporate such a strategy more generally in BioSimSpace?

Many thanks.

lohedges commented 2 years ago

Thanks for the update. I'll look at wrapping this section in the C++ layer too. I had forgotten that I needed to remap the atoms when calling getSystem, which involves a lot of searching. The mapping is consistent so it can be done once and stored for re-use, but it'd still be faster (and easy enough) to do the whole thing in C++. I should be able to test this later in the week.

Cheers.

lohedges commented 2 years ago

Hi @msuruzhon, I've now added this functionality. Everything is wrapped to C++ and I use the same approach for all of the other MD engines too. This should make the SOMD getSystem() call much quicker. Not that the first call will be the slowest, since it needs to create the molecule index mapping between the original system and the one loaded from the restart file. This can be re-used (behind the scenes) when calling getSystem again. For the other engines, a default mapping is passed (which perfectly maps indices one-to-one) so that it never needs to be computed in the first place. (It is only computed when an empty mapping is passed.)

Let me know if you come across any other bottlenecks, or if this doesn't work as anticipated.

Cheers.

msuruzhon commented 2 years ago

Hi @lohedges apologies for the very late reply but I am only now starting to take a closer look into SOMD.

I tried testing the new Sire.IO.updateCoordinatesAndVelocities() function but I get an old familiar boost error:

  File "/Users/msuruzhon/Projects/BioSimSpace/python/BioSimSpace/Process/_somd.py", line 682, in getSystem
    self._property_map)
Boost.Python.ArgumentError: Python argument types in
    Sire.IO._IO.updateCoordinatesAndVelocities(System, System, dict, bool, dict, dict)
did not match C++ signature:
    updateCoordinatesAndVelocities(SireSystem::System system0, SireSystem::System system1, QHash<SireMol::MolIdx, SireMol::MolIdx> molecule_mapping, bool is_lambda1=False, SireBase::PropertyMap map0=[  ], SireBase::PropertyMap map1=[  ])

This time, however, I don't only observe it on macOS, but on Linux as well. Have you had similar problems with this? I am using the latest Sire conda build for context.

lohedges commented 2 years ago

Hmmm, I haven't experienced this locally but will try to debug this afternoon. When adding the functionality I did need to expose wrappers for the molecule mapping input, i.e. between a Python dict and QHash<SireMol::MolIdx, SireMol::Moldx>, but those should have been added to the 2022.1.0 Sire release. I'll check in case I've missed something.

I assume this issue is occurring for all SOMD simulations, right? If so, I should be able to trigger it with a basic minimisation protocol.

msuruzhon commented 2 years ago

Thanks @lohedges, yes this issue seems to be unrelated to the protocol.

lohedges commented 2 years ago

Working in the BioSImSpace demo directory, the following works for me:

import BioSImSpace as BSS

system = BSS.IO.readMolecules("amber/ala/*")
protocol = BSS.Protocol.Minimisation(steps=100)

process = BSS.Process.Somd(system, protocol)
process.start()
process.wait()

new_system = process.getSystem()

Could you confirm that this fails for you?

I've also tested repeat calls of process.getSystem(), which will use the mapping generated by the first call as input, i.e. so it doesn't need to recreate the molecule mapping.

msuruzhon commented 2 years ago

Yes, when I assert assert new_system is not None, I get an error, so it fails while loading.

msuruzhon commented 2 years ago

And on removing the try/except part of the code, I do get the same error:

  File "/Users/msuruzhon/Projects/BioSimSpace/python/BioSimSpace/Process/_somd.py", line 682, in getSystem
    self._property_map)
Boost.Python.ArgumentError: Python argument types in
    Sire.IO._IO.updateCoordinatesAndVelocities(System, System, dict, bool, dict, dict)
did not match C++ signature:
    updateCoordinatesAndVelocities(SireSystem::System system0, SireSystem::System system1, QHash<SireMol::MolIdx, SireMol::MolIdx> molecule_mapping, bool is_lambda1=False, SireBase::PropertyMap map0=[  ], SireBase::PropertyMap map1=[  ])
lohedges commented 2 years ago

Just tested with the conda package and getSystem() is returning a NoneType. Editing the source code to raise the exception caught by the try block gives the error you are seeing. How annoying! This is using the Python 3.9 variant. I'll try another Linux conda version and report back.

This is a little worrying, since they are fairly basic wrappers which are working fine locally. (I wonder what else might not be working?)

msuruzhon commented 2 years ago

Maybe this is related to the previously observed error in macOS that got fixed by the recent update? I can also confirm that this is observed in Python 3.7.

lohedges commented 2 years ago

I'll also check what it's passing to the function, since it might not be loading the system correctly before trying to apply the remapping. Off for a bit now but will come back to this later.

lohedges commented 2 years ago

It's just the function call. The following works with a local build:

import BioSimSpace as BSS
from Sire.IO import updateCoordinatesAndVelocities

system = BSS.IO.readMolecules("amber/ala/*")

new_system, mapping = updateCoordinatesAndVelocities(
    new_system._sire_object, new_system._sire_object)

With the conda package I get:

ArgumentError: Python argument types in
    Sire.IO._IO.updateCoordinatesAndVelocities(System, System)
did not match C++ signature:
    updateCoordinatesAndVelocities(SireSystem::System system0, SireSystem::System system1, QHash<SireMol::MolIdx, SireMol::MolIdx> molecule_mapping, bool is_lambda1=False, SireBase::PropertyMap map0=[  ], SireBase::PropertyMap map1=[  ])

The help documentation is the same for updateCoordinatesAndVelocities in both cases, with the same default arguments.

lohedges commented 2 years ago

It doesn't appear to be related to the previous atom property issue, which was fixed by creating and extracting each type of atom property prior to them being used in Python. This is done on import of the Sire.IO module here. If I commet out the call to fix_atomproperty_types() I still see the same issue.

lohedges commented 2 years ago

Okay, think I've found it. For the input argument and return type I'm wrapping the following:

    register_dict< QHash<MolNum,MolNum> >();

    register_tuple< boost::tuple<System,QHash<MolNum,MolNum>> >();

These should be MolIdx objects, not MolNum. I'll rebuild the wrappers locally and push. I'm still confused why it's working locally despite this inconsistency.

lohedges commented 2 years ago

If I correct it, then the local copy no longer works. Bizarrely, it does work when MolNum is used. Perhaps there is some implicit conversion going on behind the scenes that doesn't work in all cases.

lohedges commented 2 years ago

Now having reverted the code to the original MolNum version and recompiled it's no longer working. This makes me think that I had some weird local version installed when I was trying to get this to work in the first place. I would have thought the wrappers would have been overwritten on subsequent builds, though.

lohedges commented 2 years ago

Okay, I've got it working again using the correct formatting for the containers. I'll let this rebuild and test the resulting conda package when it's finished.

lohedges commented 2 years ago

Assuming this works, I think what might have happened is that I rebuilt wrappers for something else in the docker container then copied the output into my working directory, overwriting the locally modified containers file. (This is a manually edited file, not autogenerated like the others.) I've also had some issues copying back wrappers before, since the timestamps can be out of sync with what is stored locally, which confused CMake.

I guess we'll see in a few hours :crossed_fingers:

lohedges commented 2 years ago

I can confirm that the issue is fixed on updating the conda package. Let me know how you get on. (I've not tested on macOS, but assume that it's working too.)

Thanks for reporting this. I can only assume that this was a result of an overwritten wrapper file as described above, with the change not being picked up locally due to a CMake timestamp issue. (It failed once I did a full rebuild.) Something that I'll be aware of in future when copying wrappers outside of the container.

Cheers!

msuruzhon commented 2 years ago

Many thanks for that @lohedges, I can confirm it works now on both macOS and Linux.

Unfortunately, however, the loading times are still very slow for big systems (> 20 minutes). Is it not possible to do this matching based on residue number rather than a (presumably) quadratically scaling loop? I remember I wrote a hack that rearranges the PDB file based on residue number and it worked fine for everything I tested it on. Or is that not a general solution?

Cheers.

lohedges commented 2 years ago

Ah, shame to hear that the performance is still sub-par. The way the current implementation is to remap the molecules by matching by atom numbers between molecules. The same would work for residue number too, since BioimSpace ensures that both are unique when manipulating systems, e.g. when adding molecules together. Since this uses the Sire search functionality it is searching for all matching molecules, rather than just the first. As such, there might be scope for a performance gain by aborting when the first match is found, since we know that the numbers are unique. I'll try to do this. (I might have to manually do the search.)

Other than that, I don't think there's too much we can do other than re-writing SOMD so that molecuels aren't re-ordered. This will hopefully be done in future when we get a block of time to refactor SOMD.

lohedges commented 2 years ago

Okay, I've re-written things so that I avoid using the search functionality. This means that we find the first match and exit (both when searching molecules, and atom numbers within a molecule). In my testing it's quite a bit faster. This should be even more pronounced in a large system. (Especially if the protein is large too.)

Once it's rebuilt, let me know if this makes the performance good enough.

(Annoying that I didn't think of this to begin with. The search functionality works in parallel, so is generally very quick, but doesn't assume that you only want the first match, i.e. you have already done work to make you search query unique. At some point I might try to add an option to the search so that it will exit as soon as a match is found, but this might more work that anticipated due to the way that multi query searches work.)

msuruzhon commented 2 years ago

Thanks @lohedges I will test it and let you know. Just out of curiosity: why do you need to match them atom-by-atom? Presumably the atom order is kept the same within molecules so you only need one atom from the molecule? Or is that not the case?

lohedges commented 2 years ago

Yes, I could just check against the first atom in the molecule. When I originally wrote it I assumed generality, since I wasn't quite sure of what other unexpected things SOMD might do. I've just updated this now and it is still working as expected. Will push another update shortly.

msuruzhon commented 2 years ago

Many thanks for that @lohedges, on a first pass this looks much faster than before at only ~6 seconds compared to > 20 minutes (it is also several times faster than my hack)! However, I will have to test it more to say for sure. I will update you on that in due course.

Cheers.

msuruzhon commented 2 years ago

@lohedges this has been working constistently well for me so I am going to close the issue. If I encounter any other performance bottlenecks, I will open a new one. Cheers!

lohedges commented 2 years ago

Fantastic. Thanks for confirming that things are working well.