michellab / BioSimSpaceTutorials

25 stars 10 forks source link

FEP tutorial comments #1

Closed jmichel80 closed 1 year ago

jmichel80 commented 3 years ago

Hi @JenkeScheen ,

A few comments/questions after going through your setup notebook:

JenkeScheen commented 3 years ago

Hi @jmichel80 , Thanks for the input, I'll get on this after the weekend.

jmichel80 commented 3 years ago

Hi @JenkeScheen ,

lohedges commented 3 years ago

Hi both,

Here are a few comments:

The notebook assumes the user has access to parameterised ligands (in vacuum) that have coordinates matching those of the parameterise protein input. We need separate BSS nodes to generate this type of input starting from 'sane' PDB/mol2 files.

Yes, we'll have standalone nodes for parameterisation and solvation. For the purposes of the tutorial I am assuming that you'll just have the pre-prepared systems so you would just talk about those stages and what the input/output would be, rather than actually running them. I'm don't think we will be able to use nodes for minimisation or equilibration though, since those would lose the concept of a system containing a merged molecule when writing the output to file. (We could just used the coordinates from the minimised / equilibrated systems and run system._updateCoordinates to copy them into the FEP system, but this would need us to reconstruct the system before doing so.)

The inputs haven't been energy minimised/equilibrated before starting the production runs. This may cause SOMD to crash. It would be better to equlibrate the solvated ligand0:complex, and then use a snapshot after equilbration to generate the merged molecule.

You can run minimisation / equilibration on a system containing a merged molecule. It simply creates the actual system at lambda = 0, then copies the coordinates back into the merged system when you call process.getSystem(). As mentioned above, this would work if you ran minimisation and equilibration in a single script/node prior to running the FEP simulation, but wouldn't work in standalone scripts/nodes since you would lose the merged system. (Currently we can only write to file, rather than being able to reconstruct it.)

The slurm submission script will run each lambda value serially. It would be better to run the lambda windows in parallel to decrease time to answer.

If we want to run lambda windows in parallel then we are going to need to do this outside of BioSImSpace, i.e. just use it for the setup of the FEP simulation, as is done by prepareFEP. We could then provide example batch scripts (tested on UCB systems) to show how to run each window in parallel using a job array, or similar. I also think it would be easier to also provide a script that runs analyse_freenrg for each ligand pair. This would avoid the need to go back into BioSimSpace for this single stage afterwards.

The ProcessRunner that is used by FreeEnergy.run can run jobs in parallel, but this has only been tested on multi CPU jobs, and not on an HPC system. I think it could work if there was a multi-GPU system configured using CUDA_VISIBLE_DEVICES, but I think UCB have single GPU nodes with exclusive access.

I'm still a little confused as to whether we will be assuming that the user will be setting up the FEP job on the system, or whether they are doing this locally and copying directories across. If the latter, then I don't think we're going to be able to reconstruct the FreeEnergy object, but it would be fine if we did the submission script with parallelised somd_freenrg jobs, which is the way most people actually run these things in practice.

lohedges commented 3 years ago

Just reading this bit again...

It would be better to equlibrate the solvated ligand0:complex, and then use a snapshot after equilbration to generate the merged molecule.

Yes, that would work, i.e. equilibrating the complex prior to performing the merge. (This is what happens in your diagram, I think.) I imagine that this might confuse some people, i.e. solvating, equilibrating, extracting the molecules, merging, then solvating again. However, it's probably the easiest way if we want to decouple the stages.

jmichel80 commented 3 years ago

Hi, I have given more thoughts about the best way to implement the FEP pipeline. From a user standpoint once it has been decided how to process a dataset there shouldn’t be a need for further intervention until the results are available for processing. The complexity is that executing the pipeline efficiently requires using different types of resources at different stages.

I have sketched the execution model here

suggested_workflow-white

We can solve this in slum using a mixture of job, job arrays, job dependencies and a bit of bash scripting.

We can build in some degree of resilience by having each script do some basic error checking (is the expected input present?). Ideally the pipeline is structured such that it could be executed again, not repeating steps that completed successfully.

To keep it compatible with BSS philosophy it should be possible to execute the pipeline with different FEP engines (SOMD or Gromacs at it stands). I think this is possible as the user will specify the chosen executing engine in the planFEP notebook.

I have uploaded a skeleton SLURM bash script that implements the execution model on plato. I have verified that each step is executed with correct dependencies using dummy python scripts. See https://github.com/michellab/BioSimSpaceTutorials/tree/fep/03_fep/execution_model

Next steps)

If the above can be implemented, the next step would be for @JenkeScheen to implement the required calls in the four different BSS execution scripts. The setup notebook would also have to be modified to generate as output the files 'ligands.dat', 'network.dat' , 'protocol.dat' and write the slurm bash scripts.
If we can successfully deploy this approach on plato on a real dataset, the next step would be to write an equivalent LSF script that could work on UCB's cluster.

lohedges commented 3 years ago

Hmm, there are a lot of things here that aren't currently supported and would need to be implemented / tested if we want to do this in a clean way.

lohedges commented 3 years ago

To support multiple restraint types it would also make far more sense to add a restraint option that can take matching keywords, e.g. restraint = option, where option = None, backbone, heavy, all, etc, but this would break the current API, unless we also kept the (now redundant) restraint_backbone option and ignore it when anything else is set.

jmichel80 commented 3 years ago

Let me know what you think.

lohedges commented 3 years ago

Yes, the first three bullets were effectively what I was thinking, adding some logic to go from the SLURM indexes to the specific legs.

I agree that restraints are important and it would be good to do this properly. I would like a more robust approach similar to what I suggested in my previous post. I already have the logic to translate backbone restraints for all engines that support them, and I could start working on doing the same for other restraints too. (If we agree on what we want to add.) I don't think this would be too bad, although I think some refactoring of the code would be needed.

I would also much prefer a more flexible restraint keyword as described above. This would probably mean deprecating the old option, which would be fine if we document this change and we could always wrap a call to restraint = backbone if the user passes restrain_backbone = True.

Let me know if you want me to start playing around with additional restraints and I'll start that ASAP.

jmichel80 commented 3 years ago

Ok let’s focus on the restraints. I will flesh out the run and analysis scripts tonight with SOMD specific commands. Will need some help with the Gromacs commands, presumably they can be extracted from what BSS would have written. We also need a good way to locate this BSS dependency. SOMD is easier since we can assume it may be found in $BSSHOME. You can follow the equilibration protocol of FESetup for the sequence of restraints if you need more specific details. Can post links to relevant parts of the code tonight if you need pointers.

lohedges commented 3 years ago

No problem. BioSImSpace will autodetect the path to GROMACS so it can write it out for us when the user specifies the engine. We can get the path using free_nrg.runner.processes[0].exe() and can get the specific run arguments using free_nrg.runner.processes[0].getArgs().

lohedges commented 3 years ago

If possible, it would be good to agree on what type of restaints we want to support. In your example, I see things like the following:

NVT equilibration with restraints on backbone atoms and ligand

I'm not sure that something this flexible would be easy to support across all engines, especially if just using keyword options for the restraint. Ideally we could use the search functionality to generate the list of atom indices involved in the restraint and pass those as an option instead. (Still allowing less flexible keywords too. So the restraint option could be keywords, or a list of atom indices.)

Just to point out how fiddly all this is. Since we preserve atom naming by default (one of our core principles) it makes writing general restraint files a real pain. For example, if you take an AMBER format system and use gmx genrstr to apply the restraints (as we do to make things easier), then you end up with errors such as the following:

'gmx grompp' reported the following warnings:
WARNING 1 [file gromacs.top, line 256]:
  1890 non-matching atom names
  atom names from /home/lester/Code/BioSimSpace/demo/tmp/gromacs.top will be
  used
  atom names from /home/lester/Code/BioSimSpace/demo/tmp/gromacs.gro will
  be ignored

It doesn't complain when not adding restraints, or when passing a regular GROMACS system and using restraints. This makes me think that we'll have to handle all the atom index finding / masking, etc. manually for all packages and do everything explicitly by index. This would mean that we would also need to come up with robust ways of using the search functionality for detecting all types of atoms that we would want to restrain.

lohedges commented 3 years ago

I've created an issue thread for the discussion of equilibration restraints here.

jmichel80 commented 3 years ago

The ligand restraint is useful in case the initial pose has steric clashes and we don't want an early equilibration to push the ligand away from its starting binding mode. Basically we aim to relax a bit the protein and solvent around the starting pose in the hope that it will remain in place once restraints are released.

If we cannot easily implement restraints that select ligand atoms we could skip this for now.

From your post for #203 it looks like it could be some work to support flexible restraints for all supported MD engines. I suggest we prioritise support for OpenMM and Gromacs initially to allow delivery of the FEP use case, and fill in support for other engines later.

jmichel80 commented 3 years ago

One downside of using a shell script for BSSanalyse.sh that just calls directly gmx mbar or analyse_freenrg is that the pipeline is not modular at this stage. We could have wanted to run an FEP with SOMD or GROMACS, and then use the same analysis tool for processing the previous outputs consistently to estimate a free energy. I will proceed with that route for now but we should make note of this issue.

lohedges commented 3 years ago

The script could detect the format of the output in the directories and select the appropriate analysis tool. I guess this would provide the flexibility that we want in the short-term, i.e. being able to run different bits with different engines and combine results.

(Apologies if I've misunderstood the problem.)

lohedges commented 3 years ago

I've now added some functionality that you can do static analysis of an existing simulation using BioSimSpace.FreeEnergy.analyse(work_dir). It will figure out everything from the information in the work_dir, i.e. the type of simulation, the engine that was used, etc. This means that you could run the simulations outside of BioSimSpace, e.g. on a cluster, then go back into BioSimSpace to perform the analysis afterwards.

I've also added a BioSimSpace.FreeEnergy.getData() method, which can be used to zip up just the files that are needed for analysis, while preserving the folder structure of the work_dir. This could be use when someone does use BioSimSpace.FreeEnergy.run() but wants to copy back a subset of the output files for analysis, i.e. excluding all the config files, trajectories, etc.

lohedges commented 3 years ago

I've added another update when so getData is static so you can use it on an existing working directory without needing to instantiate another FreeEnergy object.

(N.B. The Sire macOS build from this morning timed out so all BioSimSpace macOS builds will fail until the re-run of the Sire build completes.)