[TUTORIAL] Funnel metadynamics

lohedges commented 3 years ago

This is a thread to discuss the creation of a tutorial showing how to implement funnel metadynamics within BioSimSpace.

dlukauskis commented 3 years ago

You could use the HILLS file instead? It's got all the same information.

lohedges commented 3 years ago

For OpenMM I think we write to both at the same frequency so it's not an issue to use either. Normally, I write the COLVAR more frequently so it's useful to get a fine-grained time series when running interactively.

The NumPy binary format is fully portable so it's no problem to use that instead in this case.

lohedges commented 3 years ago

This seems to work. I just need to make some fixes to correctly handle multi-component CVs. I do so for writing the PLUMED file, but not for getting information back from the COLVAR and HILLS.

lohedges commented 3 years ago

Here's an example notebook showing how to use the PLUMED process wrapper to analyse existing data. Just copy the data from one of your runs into work_dir and run the notebook.

(I've just pushed changes so it will take a while for a conda package to build.)

lohedges commented 3 years ago

I've also bound the PLUMED functions to Process.OpenMM (liked I already do for GROMACS) so you can also perform analysis while the simulation is running. Here is a link to the documention for the notebook functions so you can see how to customise the plots, e.g. changing labels, etc. (Customisation is limited, but functional.) Labels default to the unit of any value if no string is passed.

dlukauskis commented 3 years ago

Great work, Lester. Using your NB I tried my own inputs and I got one small issue, where the FES plot is not getting the x limits correctly, making the plot truncated. We still need to have a couple of functions that calculate the funnel correction term and can plot the time evolution of the binding free energy estimate, to check whether the metadynamics simulation has converged. I got all the data and examples of how I do it.

I'm attaching the input files and the NBs.

lohedges commented 3 years ago

I've fixed the plot issue. I was setting the upper limit of my x range to the max of y. Previously I had only plotted torsional data, which is the same range in x and y.

I'll look at implementing the correction term too. Thanks for the detailed analysis scripts, they are very easy to follow.

lohedges commented 3 years ago

You can now get the funnel correction factor using cv.getCorrection() where cv is an object of type BioSimSpace.Metadynamics.CollectiveVariable.Funnel. By default it uses the last 5 nanometers along the projection axis, using the upper-bound as the limit. This can be changed by passing in values for x_min and x_max.

We can already get time-series of the free-energy projections by passing the stride option to plumed.getFreeEnergy. For example:

free_nrg = plumed.getFreeEnergy(0, stride=10)

This would get the one-dimensional free-energy for the first CV component at intervals of 10. The resulting object would be a list containing the free energy data at each time sample (x and y values with units). You could then use these to check for convergence. At present BioSimSpace.Notebook.plot only handles single data sets, although I might be able to update this to plot multiple y data against the same x. However, it would be easy to just plot the last two data sets to compare them, e.g.:

free_nrg = plumed.getFreeEnergy(0, stride=10)
BSS.Notebook.plot(free_nrg[-2][0], free_nrg[-2][1])
BSS.Notebook.plot(free_nrg[-1][0], free_nrg[-1][1])

lohedges commented 3 years ago

I've tweaked a few parameter names in the above functions for clarity: x_min is now proj_min and x_max is now proj_max. I've also added a method cv.getExtent(proj) that can get the funnel coordinate projected on the extent axis given a value along the projection axis.

dlukauskis commented 3 years ago

Awesome job, Lester. Looks like the integration stride is working but some of indices of the integrated free energies are wrong, i.e. I'm getting some weird spikes when you do a dG vs time plot. This sounds like an easy fix.

Here is the NB that shows the problem. You need to run in the same directory I shared a few days ago with the data.

lohedges commented 3 years ago

Yes, this is easy to fix. I'm sorting the fes*.dat files, which puts them in the wrong order since the extensions aren't zero padded. I'll let you know when I've pushed a fix.

lohedges commented 3 years ago

Should now be fixed.

dlukauskis commented 3 years ago

Lester, could you help me out with turning one of my all-in-one BSS fun-metaD scripts into something more coherent and modular? I still haven't fully grasped the concept of a node, so your input would be really helpful. This is the script I used to generate the fun-metaD files for the analysis tutorial. It's from my branch of the tutorial repo.

lohedges commented 3 years ago

Sure thing. I'll try converting it into a node later today and send you an update tomorrow morning.

lohedges commented 3 years ago

Here is an example node, both in notebook and script format. Note that it's been generalised to work with any supported molecular dynamics engine, and the input is assumed to be a solvated and equlilibrated protein ligand complex. (The user could use other nodes to parameterise, solvate, minimise, equlibrate, etc.).

I'm not sure if we want to be this general for your case, e.g. if you wanted to ship a completely standalone script. In this case, you could add additional requirements to specify things such as the force fields, water model, etc.

You can run the fun_metad.py script from the command-line, e.g.:

python fun_metad.py --help

usage: fun_metad.py [-h] [-c CONFIG] [-v [VERBOSE]]
                    [--export-cwl [EXPORT_CWL]]
                    [--strict-file-naming [STRICT_FILE_NAMING]] --files FILES
                    [FILES ...] [--runtime RUNTIME]
                    [--hill_height HILL_HEIGHT] [--bias_factor BIAS_FACTOR]
                    [--engine {Amber,Gromacs,OpenMM,auto}]
                    [--work_dir WORK_DIR]

Perform funnel metadynamics on a solvated protein-ligand complex.

Args that start with '--' (e.g. --arg) can also be set in a config file
(specified via -c). The config file uses YAML syntax and must represent a YAML
'mapping' (for details, see http://learn.getgrav.org/advanced/yaml). If an arg
is specified in more than one place, then commandline values override config
file values which override defaults.

Output:
  final: FileSet        The final system, in the original format plus a PDB
                        file.

Required arguments:
  --files FILES [FILES ...]
                        A set of molecular input files.

Optional arguments:
  -h, --help            Show this help message and exit.
  -c CONFIG, --config CONFIG
                        Path to configuration file.
  -v [VERBOSE], --verbose [VERBOSE]
                        Print verbose error messages.
  --export-cwl [EXPORT_CWL]
                        Export Common Workflow Language (CWL) wrapper and exit.
  --strict-file-naming [STRICT_FILE_NAMING]
                        Enforce that the prefix of any file based output matches its name.
  --runtime RUNTIME     The run time.
                          units=Nanosecond
                          default=100.0000 ns
                          min=5.0000 ns, max=500.0000 ns
  --hill_height HILL_HEIGHT
                        The hill height.
                          units=Kilo joules per mol
                          default=1.5000 kJ/mol
                          min=1.0000 kJ/mol, max=10.0000 kJ/mol
  --bias_factor BIAS_FACTOR
                        The bias factor for well-tempered metadynamics.
                          default=10.0
                          min=1.0, max=100.0
  --engine {Amber,Gromacs,OpenMM,auto}
                        The molecular dynamics engine.
                          default=auto
  --work_dir WORK_DIR   The working directory for the simulation.
                          default=fun_metad

(Note that, for some of the example systems, GROMACS doesn't work unless you pass ignore_warnings=True to the BSS.Metadynamics.run constructor, since the system is charged. Perhaps this isn't an issue for the system used in the tutorial.)

lohedges commented 3 years ago

Note that I haven't fully tested the above, it's just to give you an idea of what's possible. The notebook version could also include full documentation and even graphics, which would be ignored when run from the command-line, e.g. you could view the funnel, or the final system.

dlukauskis commented 3 years ago

Like I mentioned in the meeting, the restarts for fun-metaD with openMM didn't work so I had a go at redoing a lot of it. Check out the ZIP I'm attaching with the old and the new openmm.py. The biggest change is that we'll use XML files for restarts instead of checkpoint files, as they are more flexible. XML files take longer to write, so I made sure we write them less often, every 100 ps or so. Let me know if something is unclear.

lohedges commented 3 years ago

Thanks, Dom. This looks great! I think I'll make the regular PLUMED restart implementation consistent with this, i.e. just perform a restart when existing files are in the working directory. At present, you can pass the paths to existing HILLS and GRID files when setting up the protocol. My idea was that the user might want to store files from existing simulations and re-run them later, or copy files between machines, etc. It makes more sense just to re-use the same files in the existing directory if you are re-running on the same HPC system.

dlukauskis commented 3 years ago

Great, I've made one more change that tracks the steps that have been logged so far. I only tested the previous script by just restarting it once, but we want this to be useful for any number of restarts. Here's the openmm.py with the latest changes.

lohedges commented 3 years ago

I'll try to add similar logic to the OpenMM production protocol too.

lohedges commented 3 years ago

Just checking the logic here:

steps_so_far = 0
if os.path.isfile('openmm.xml'):
    simulation.loadState('openmm.xml')
    is_restart = True
    print('Loading a restart')
    log = np.loadtxt('openmm.log')
    steps = log[:,0]
    record_log_every = 1000
    # find indices where a restart happened
    restart_indices = np.where(steps == record_log_every)[0]
    for idx in restart_indices[1:]:
        # look for how many steps were logged before the restart
        steps_so_far += int(steps[idx-1])
    # finally add the last logged number of steps
    steps_so_far += steps[-1]`
else:
    is_restart = False

total_steps = 50000000

Here report_log_every would be obtained from the protocol object, i.e. report interval, so it would break if the user tried a restart with a different reporting frequency. I guess we could check the log file to make sure that the spacing between entries is always consistent.

Also, having loaded the XML state in OpenMM, does the state reporter not continue from the existing point, i.e. does it already know the accumulated steps and time?. Here it looks like you a searching for the first report interval (1000 steps) in the log file to get all of the points at which a a restart occurred, then are accumulating the total steps yourself. If OpenMM does start from zero again this would mess up my reporting of time series data and I'd need to add some extra logic to fix the lists of steps and time values that would be returned to the user. (Perhaps this is one difference between using saveCheckpoint and saveState in OpenMM.) Maybe the state reporter info is also serialized, so we don't need to recreate it when performing a restart?

dlukauskis commented 3 years ago

report_log_every should be == to openmm.log write frequency. I suppose if the user changed the frequency between the restarts that would break things.

I don't think the binary checkpoint file tracked time or steps, but now that you mentioned it, I checked openmm.xml contents and it does tell you the write time!

<State openmmVersion="7.4.2" time="3201.9999999381066" type="State" version="1">

That makes things much easier, just look inside the XML file and figure out the elapsed number of steps.

dlukauskis commented 3 years ago

If OpenMM does start from zero again this would mess up my reporting of time series data and I'd need to add some extra logic to fix the lists of steps and time values that would be returned to the user.

I don't know if you can tell openMM to restart from step X (as read from the state file), I think each time you create a simulation object, it just starts the step count from zero.

lohedges commented 3 years ago

Thanks for the info. Since we're working in Python land I imagine that it will be possible to set some attribute of the simulation object (probably private) to specify the starting step and time. If not, we should be able to just monkey-patch the state reporter so that it appends the correct values, i.e. offsetting them by the final step and time from the previous run. I'll play around on Monday.

lohedges commented 3 years ago

The context member of the simulation object has a setTime method. The is also setParameter, which takes a key-value pair, so could presumably be used to set the step too. I'll see if I can get something working. (I'm still surprised that these aren't loaded and set from the state, though.)

lohedges commented 3 years ago

It turns out that the simulation time is already set correctly, i.e. it continues from the previous simulation. It's also very easy to set the step:

if os.path.isfile('openmm.xml'):
    simulation.loadState('openmm.xml')
    with open('openmm.log') as f:
        lines = f.readlines()
        last_line = lines[-1].split()
        step = int(last_line[0])
        simulation.currentStep = step

I think I'll try to monkey-patch the state reporter so that it doesn't write the header for repeats. Alternatively, I'll let it write the header, then make sure that it's consistent with the first one that was found in the log file. This will make sure that the information from the repeats is consistent.

dlukauskis commented 3 years ago

Looks awesome! You could open a PR on openmm's repo, they'd be interested in incorporating this properly. See #3071.

lohedges commented 3 years ago

Okay, I think I've almost got this working. A quick question from testing... Do you know why I always get the following error if the bias factor is set to 1?

Traceback (most recent call last):
  File "openmm.py", line 166, in <module>
    current_cvs = np.array(list(meta.getCollectiveVariables(simulation)) + [meta.getHillHeight(simulation)])
  File "/home/lester/Downloads/fixed_restarts/metadynamics.py", line 197, in getHillHeight
    currentHillHeight = self.height*np.exp(-energy/(unit.MOLAR_GAS_CONSTANT_R*self._deltaT))
  File "/home/lester/sire.app/lib/python3.7/site-packages/simtk/unit/quantity.py", line 406, in __truediv__
    return (self/other._value) / other.unit
  File "/home/lester/sire.app/lib/python3.7/site-packages/simtk/unit/quantity.py", line 409, in __truediv__
    return self * pow(other, -1.0)
ZeroDivisionError: 0.0 cannot be raised to a negative power

It looks like I'll need to figure this out, or set a different default value.

lohedges commented 3 years ago

Setting it to anything above 1.0 works, i.e. 1.000001, so I'll just do that. Must be a rounding issue.

lohedges commented 3 years ago

I've pushed an update the implements restarts for production and metadynamics protocols with OpenMM. I've done some basic testing, but could you check that it works as expected for your metadynamics runs?

I've also updated the way restarts are handled for the regular PLUMED implementation so that things are consistent. When I'll get time, I'll look at implementing something simulation for the regular production (and possibly equilibration) protocols with the other engines. (Equilibration is trickier, since you might need to know and be able to set the current temperature.)

lohedges commented 3 years ago

Just realised that I need to fix the hardcoded check for the checkpoint frequency, i.e. make it work regardless of the integration time step, etc. I'll update that tomorrow.

dlukauskis commented 3 years ago

Hey Lester, I'm having some PBC-related issues with BSS funnel assignment. If I use a truncatedOctagedron box and run the setup simulations independently multiple times, sometimes BSS will fail to make a funnel, telling me about not finding any nearby CA atoms. I've had a look at the structures and it's an issue with PBC, where equilibration will sometimes end up translating the ligand across the periodic boundary. Here's a ZIP with the input files and a NB. It's odd that MDAnalysis doesn't account for that.

lohedges commented 3 years ago

Hi Dom. I'll take a look when I'm back towards the end of next week. We actually use Sire's native search functionality rather than MDAnalysis since it's much faster. Quickly looking at the code it appears to do the distance search in an infinite cartesian space so isn't taking the periodic boundaries into account. You can pass through a different space though, so it should be able to handle periodic orthorhmombic and tricnlinic systems too. Orthorhombic systems seem to work fine but I'm not getting the same results for a cubic system represented as a triclinic space. I'll come back to this next week.

lohedges commented 3 years ago

I managed to quickly fix this. (The joys of yet another washout day and having finished all of the books that I brought with me.) The makeFunnel code should now work for orthorhombic and triclinic systems. I also found that Sire's built in center-of-mass evaluator also doesn't consider periodic boundaries, so I've manually adjusted that too, i.e. for locating the binding site from the ligand CoM.

In adding support for periodic systems I also discovered a subtle issue with the Sire TriclinicBox object that could cause memory corruption on copy. This didn't affect reading or writing of triclinic systems, only the internal calculation of distances etc. using a copied space object, which is what happened to be required to solve this problem. As such, you'll need to update both Sire and BioSimSpace to access the new functionality. (Probably easiest to recreate your environment from scratch.)

dlukauskis commented 3 years ago

Thanks for that Lester. I've noticed one other thing that I've overlooked. When I do hydrogen mass repartitioning, use a 4 fs timestep and deposit hills in half the usual number of steps, the COLVAR and HILLS file still records the CVs and hill heights every 1000 steps, instead of 500 steps. This basically leads to information loss, with half the hills missing in the record, as we deposit every 500 steps, but record only every 1000. PLUMED wouldn't be able to reconstruct the resulting FES correctly.

My proposal is instead of

# Run the simulation.
total_steps = 2500000
total_cycles = 2500
remaining_steps = 2500000
steps_per_cycle = math.ceil(total_steps / total_cycles)
remaining_cycles = math.ceil(remaining_steps / steps_per_cycle)
start_cycles = total_cycles - remaining_cycles
checkpoint = 100

It could be

# Run the simulation.
total_steps = 2500000
steps_per_cycle = 500 # ie hill deposition rate
total_cycles = math.ceil(total_steps/steps_per_cycle)
remaining_steps = 2500000
remaining_cycles = math.ceil(remaining_steps / steps_per_cycle)
start_cycles = total_cycles - remaining_cycles
checkpoint = 100

lohedges commented 3 years ago

Thanks for catching this. It will need a little more thought, since I need to be consistent with what I do for the other engines that support metadynamics where I decouple the frequency at which I report to the log file and deposit the hills. (PLUMED's reporting is independent of the engine to which it is coupled.) This would basically mean that I would need to remove the cycles part and just have a checkpoint system that writes to the log and hills at whatever frequency the user specifies, e.g.:

for x in range(start_step, total_steps):
    while x >= report_checkpoint:
        # Write to log file.
        report_checkpoint += report_interval
    while x >= hills_checkpoint:
        # Write to the hills file.
        hills_checkpoint += hills_interval

lohedges commented 3 years ago

Actually, it's easier than I thought since the OpenMM state reporter is independent of the cycle logic. I'll just use the hill frequency from the protocol to determine the number of cycles.

lohedges commented 3 years ago

This is now fixed. The COLVAR and HILLS files are written at the hill deposition frequency, whereas the OpenMM state reporter uses the report interval from the protocol. This is consistent with what I do for the other metadynamics engines.

dlukauskis commented 3 years ago

Hi Lester, I've been working on a manuscript on a new fun-metaD variation. I had issues with convergence using projection/extent so I tried using RMSD of the ligand as a CV. It's much better at rebinding the ligand. I call this combination of CVs (proj/RMSD) fun-RMSD. Instead of realigning the protein to calculate ligand RMSD, I used p1 set of atoms as part of the indices to calculate the CV. The p1 atoms don't seem to be affected much by the bias, so I think this is a reasonable approach.

Could you implement fun-RMSD into BioSimSpace? Here are the files that show how fun-RMSD differs from fun-metaD. It's basically just 5 lines of code. The analysis part will look exactly the same as well, we integrate out the RMSD to construct the 1D FES along the projection CV. Let me know if you got any questions.

lohedges commented 3 years ago

Yes, no problem, this looks super easy to implement.

Just a quick question: Is it possible to do something similar with the regular PLUMED implementation? I guess you could add an RMSD collective variable in the same way, and we already have functionality to write the required PDB file that is required (this was needed for the steered MD tutorial). If not, then I worry that we a providing two different implementations, and this might not be transparent to the user without printing some warnings or renaming some of the objects. (For example, we could have FunnelProjExt and FunnelProjRMSD CVs.) It might even be possible to do something by combining the existing Funnel and RMSD collective variable objects. (Originally I wanted to provide a way of building multi-dimensional CVs, but this is quite tricky in practice.)

lohedges commented 3 years ago

Looking at the existing code I think it would be easiest to have two collective variable objects. Currently, all the docstrings refer to extent as the second component so, if we went for a single object, this would need to be updated to extent or RMSD.

francoviscarra commented 2 years ago

Hi everyone, I'm trying to follow the tutorial here, but the script prompts me with AttributeError: module 'BioSimSpace.Metadynamics.CollectiveVariable' has no attribute 'makeFunnel' .

lohedges commented 2 years ago

Hello there,

Could you confirm how you installed BioSImSpace and what version of the code you are using. I imagine that you have an old package that doesn't have the funnel metadynamics functionality.

import BioSimSpace as BSS
print(BSS.__version__)

francoviscarra commented 2 years ago

I installed it from the binary install , as the conda install gets stuck while solving environment. The version is the 2020.1.0.

lohedges commented 2 years ago

Yes, you'll need to use a more recent version with the dev or workshop label. These changes were added in early 2021. Could you try the following, which installs for me:

conda create -n biosimspace -c conda-forge -c omnia -c michellab/label/workshop biosimspace
conda activate biosimspace

Cheers.

francoviscarra commented 2 years ago

I ended up using mamba mamba create -n biosimspace -c conda-forge -c omnia -c michellab/label/workshop biosimspace and it seems to work fine now. Thank you very much!

michellab / BioSimSpace

[TUTORIAL] Funnel metadynamics #194