Closed jmichel80 closed 1 year ago
Hi @jmichel80 , Thanks for the input, I'll get on this after the weekend.
.run()
/.analyse()
)Hi @JenkeScheen ,
Hi both,
Here are a few comments:
The notebook assumes the user has access to parameterised ligands (in vacuum) that have coordinates matching those of the parameterise protein input. We need separate BSS nodes to generate this type of input starting from 'sane' PDB/mol2 files.
Yes, we'll have standalone nodes for parameterisation and solvation. For the purposes of the tutorial I am assuming that you'll just have the pre-prepared systems so you would just talk about those stages and what the input/output would be, rather than actually running them. I'm don't think we will be able to use nodes for minimisation or equilibration though, since those would lose the concept of a system containing a merged molecule when writing the output to file. (We could just used the coordinates from the minimised / equilibrated systems and run system._updateCoordinates
to copy them into the FEP system, but this would need us to reconstruct the system before doing so.)
The inputs haven't been energy minimised/equilibrated before starting the production runs. This may cause SOMD to crash. It would be better to equlibrate the solvated ligand0:complex, and then use a snapshot after equilbration to generate the merged molecule.
You can run minimisation / equilibration on a system containing a merged molecule. It simply creates the actual system at lambda = 0, then copies the coordinates back into the merged system when you call process.getSystem()
. As mentioned above, this would work if you ran minimisation and equilibration in a single script/node prior to running the FEP simulation, but wouldn't work in standalone scripts/nodes since you would lose the merged system. (Currently we can only write to file, rather than being able to reconstruct it.)
The slurm submission script will run each lambda value serially. It would be better to run the lambda windows in parallel to decrease time to answer.
If we want to run lambda windows in parallel then we are going to need to do this outside of BioSImSpace, i.e. just use it for the setup of the FEP simulation, as is done by prepareFEP
. We could then provide example batch scripts (tested on UCB systems) to show how to run each window in parallel using a job array, or similar.
I also think it would be easier to also provide a script that runs analyse_freenrg
for each ligand pair. This would avoid the need to go back into BioSimSpace for this single stage afterwards.
The ProcessRunner
that is used by FreeEnergy.run
can run jobs in parallel, but this has only been tested on multi CPU jobs, and not on an HPC system. I think it could work if there was a multi-GPU system configured using CUDA_VISIBLE_DEVICES
, but I think UCB have single GPU nodes with exclusive access.
I'm still a little confused as to whether we will be assuming that the user will be setting up the FEP job on the system, or whether they are doing this locally and copying directories across. If the latter, then I don't think we're going to be able to reconstruct the FreeEnergy
object, but it would be fine if we did the submission script with parallelised somd_freenrg
jobs, which is the way most people actually run these things in practice.
Just reading this bit again...
It would be better to equlibrate the solvated ligand0:complex, and then use a snapshot after equilbration to generate the merged molecule.
Yes, that would work, i.e. equilibrating the complex prior to performing the merge. (This is what happens in your diagram, I think.) I imagine that this might confuse some people, i.e. solvating, equilibrating, extracting the molecules, merging, then solvating again. However, it's probably the easiest way if we want to decouple the stages.
Hi, I have given more thoughts about the best way to implement the FEP pipeline. From a user standpoint once it has been decided how to process a dataset there shouldn’t be a need for further intervention until the results are available for processing. The complexity is that executing the pipeline efficiently requires using different types of resources at different stages.
I have sketched the execution model here
We can solve this in slum using a mixture of job, job arrays, job dependencies and a bit of bash scripting.
We can build in some degree of resilience by having each script do some basic error checking (is the expected input present?). Ideally the pipeline is structured such that it could be executed again, not repeating steps that completed successfully.
To keep it compatible with BSS philosophy it should be possible to execute the pipeline with different FEP engines (SOMD or Gromacs at it stands). I think this is possible as the user will specify the chosen executing engine in the planFEP notebook.
I have uploaded a skeleton SLURM bash script that implements the execution model on plato. I have verified that each step is executed with correct dependencies using dummy python scripts. See https://github.com/michellab/BioSimSpaceTutorials/tree/fep/03_fep/execution_model
Next steps)
If the above can be implemented, the next step would be for @JenkeScheen to implement the required calls in the four different BSS execution scripts. The setup notebook would also have to be modified to generate as output the files 'ligands.dat', 'network.dat' , 'protocol.dat' and write the slurm bash scripts.
If we can successfully deploy this approach on plato on a real dataset, the next step would be to write an equivalent LSF script that could work on UCB's cluster.
Hmm, there are a lot of things here that aren't currently supported and would need to be implemented / tested if we want to do this in a clean way.
Your BSSligprep.py script has equlibration stages with restraints on all non-solvent atoms. We currently only support restraints on backbone atoms (this was all that was ever asked for). I agree that it would be good to have other options, e.g. all heavy atoms, all non-water atoms, any atom indices the user likes, etc, but this would need to be implemented and tested. It's not too hard, but the different engines support this in very different ways and some rely on atom naming for the restraints, which makes things hard to generalise.
In BSSrunFEP.py, are we just running the pre-setup legs using the MD engine directly, i.e. writing command-line calls to somd_freenrgy
or gmx mdrun
directly? If so, does this need to be a Python script? If not, we won't be able to do this with BioSimSpace since we can't reconstruct the free energy object without signinficant work, i.e. we'd need to reload the original ligands and protein (with their original names), recreate the mapping (using whatever options the user chose), align (again with whatever options the user chose), merge, solvate, setup the entire object, etc.
Similar to the above BSSanalyseFEP.py seems to be using BioSimSpace to perform the analysis. Unless I can make the analysis calls static (they currently depend on the state of the FreeEnergy
object) we can't go back inside BioSImSpace to do this having run the legs independently. It would be easier to add a single gmx mbar
or analyse_freenrg
command to the run scripts above that analyse the data for each leg following the run. We could then collate this information ready for processing with freenrg_workflows
in a separate script.
To support multiple restraint types it would also make far more sense to add a restraint option that can take matching keywords, e.g. restraint = option
, where option = None, backbone, heavy, all, etc
, but this would break the current API, unless we also kept the (now redundant) restraint_backbone
option and ignore it when anything else is set.
BSSrunFEP.py : Since we have previously written the input files for a perturbation it could be a direct call to an FEP engine. It would need logic to map $SLURM_ARRAY_TASK_ID to a lambda value for a free/bound leg. Parsing 'protocol.dat' would tell us which engine is to be used. We would still need to write down a simulation config file (see below).
BSSanalyseFEP.py: Could also do the same.
BSSprepFEP.py: Could write down the run and analysis commands based on the chosen engine. Effectively ending use of BSS in the pipeline at this stage.
That leaves us with supporting various types of restraints in BSSligprep.py. Experience suggests that a multi-staged equilibration where restraints are gradually removed is important for robust processing of inputs (see how FESetup does it). One option is to rely on a single MD engine for minimisation/equilibration (OpenMM or SOMD as they are bundled with BSS), and then switch to the chosen FEP engine at the BSSprepFEP.py stage. We could generalise later.
Let me know what you think.
Yes, the first three bullets were effectively what I was thinking, adding some logic to go from the SLURM indexes to the specific legs.
I agree that restraints are important and it would be good to do this properly. I would like a more robust approach similar to what I suggested in my previous post. I already have the logic to translate backbone restraints for all engines that support them, and I could start working on doing the same for other restraints too. (If we agree on what we want to add.) I don't think this would be too bad, although I think some refactoring of the code would be needed.
I would also much prefer a more flexible restraint
keyword as described above. This would probably mean deprecating the old option, which would be fine if we document this change and we could always wrap a call to restraint = backbone
if the user passes restrain_backbone = True
.
Let me know if you want me to start playing around with additional restraints and I'll start that ASAP.
Ok let’s focus on the restraints. I will flesh out the run and analysis scripts tonight with SOMD specific commands. Will need some help with the Gromacs commands, presumably they can be extracted from what BSS would have written. We also need a good way to locate this BSS dependency. SOMD is easier since we can assume it may be found in $BSSHOME. You can follow the equilibration protocol of FESetup for the sequence of restraints if you need more specific details. Can post links to relevant parts of the code tonight if you need pointers.
No problem. BioSImSpace will autodetect the path to GROMACS so it can write it out for us when the user specifies the engine. We can get the path using free_nrg.runner.processes[0].exe()
and can get the specific run arguments using free_nrg.runner.processes[0].getArgs()
.
If possible, it would be good to agree on what type of restaints we want to support. In your example, I see things like the following:
NVT equilibration with restraints on backbone atoms and ligand
I'm not sure that something this flexible would be easy to support across all engines, especially if just using keyword options for the restraint. Ideally we could use the search functionality to generate the list of atom indices involved in the restraint and pass those as an option instead. (Still allowing less flexible keywords too. So the restraint
option could be keywords, or a list of atom indices.)
Just to point out how fiddly all this is. Since we preserve atom naming by default (one of our core principles) it makes writing general restraint files a real pain. For example, if you take an AMBER format system and use gmx genrstr
to apply the restraints (as we do to make things easier), then you end up with errors such as the following:
'gmx grompp' reported the following warnings:
WARNING 1 [file gromacs.top, line 256]:
1890 non-matching atom names
atom names from /home/lester/Code/BioSimSpace/demo/tmp/gromacs.top will be
used
atom names from /home/lester/Code/BioSimSpace/demo/tmp/gromacs.gro will
be ignored
It doesn't complain when not adding restraints, or when passing a regular GROMACS system and using restraints. This makes me think that we'll have to handle all the atom index finding / masking, etc. manually for all packages and do everything explicitly by index. This would mean that we would also need to come up with robust ways of using the search functionality for detecting all types of atoms that we would want to restrain.
I've created an issue thread for the discussion of equilibration restraints here.
The ligand restraint is useful in case the initial pose has steric clashes and we don't want an early equilibration to push the ligand away from its starting binding mode. Basically we aim to relax a bit the protein and solvent around the starting pose in the hope that it will remain in place once restraints are released.
If we cannot easily implement restraints that select ligand atoms we could skip this for now.
From your post for #203 it looks like it could be some work to support flexible restraints for all supported MD engines. I suggest we prioritise support for OpenMM and Gromacs initially to allow delivery of the FEP use case, and fill in support for other engines later.
One downside of using a shell script for BSSanalyse.sh that just calls directly gmx mbar or analyse_freenrg is that the pipeline is not modular at this stage. We could have wanted to run an FEP with SOMD or GROMACS, and then use the same analysis tool for processing the previous outputs consistently to estimate a free energy. I will proceed with that route for now but we should make note of this issue.
The script could detect the format of the output in the directories and select the appropriate analysis tool. I guess this would provide the flexibility that we want in the short-term, i.e. being able to run different bits with different engines and combine results.
(Apologies if I've misunderstood the problem.)
I've now added some functionality that you can do static analysis of an existing simulation using BioSimSpace.FreeEnergy.analyse(work_dir)
. It will figure out everything from the information in the work_dir
, i.e. the type of simulation, the engine that was used, etc. This means that you could run the simulations outside of BioSimSpace, e.g. on a cluster, then go back into BioSimSpace to perform the analysis afterwards.
I've also added a BioSimSpace.FreeEnergy.getData()
method, which can be used to zip up just the files that are needed for analysis, while preserving the folder structure of the work_dir
. This could be use when someone does use BioSimSpace.FreeEnergy.run()
but wants to copy back a subset of the output files for analysis, i.e. excluding all the config files, trajectories, etc.
I've added another update when so getData
is static so you can use it on an existing working directory without needing to instantiate another FreeEnergy
object.
(N.B. The Sire macOS build from this morning timed out so all BioSimSpace macOS builds will fail until the re-run of the Sire build completes.)
Hi @JenkeScheen ,
A few comments/questions after going through your setup notebook:
The notebook assumes the user has access to parameterised ligands (in vacuum) that have coordinates matching those of the parameterise protein input. We need separate BSS nodes to generate this type of input starting from 'sane' PDB/mol2 files.
The inputs haven't been energy minimised/equilibrated before starting the production runs. This may cause SOMD to crash. It would be better to equlibrate the solvated ligand0:complex, and then use a snapshot after equilbration to generate the merged molecule.
The slurm submission script will run each lambda value serially. It would be better to run the lambda windows in parallel to decrease time to answer.