michellab / BioSimSpace

Code and resources for the EPSRC BioSimSpace project.
https://biosimspace.org
GNU General Public License v3.0
77 stars 19 forks source link

[TUTORIAL] Steered MD #192

Open lohedges opened 3 years ago

lohedges commented 3 years ago

This is a thread to discuss the creation of a tutorial showing how to implement steered molecular dynamics in BioSimSpace.

jmichel80 commented 3 years ago

Adding current diagram summarising use case:

steered_md-white

AdeleHardie commented 3 years ago

I have started testing GROMACS+PLUMED for steered MD here.

I looked at BioSimSpace.Metadynamics.CollectiveVariable and the only available ones at the moment are distance, funnel and torsion. So for now I prepared PLUMED input and structure reference files as I did before, and it was simple to add PLUMED used usage to BioSimSpace.Protcol.Production through using BioSimSpace.Process.Gromacs.setArgs().

I have also looked at creating a BioSimSpace.Protcol.Metadynamics with a dummy BioSimSpace.Metadynamics.CollectiveVariable.Distance and simply modify the plumed.dat file to change to RMSD, but it uses METAD whereas I've been using MOVINGRESTRAINT. An alternative would be to simply replace that file completely, and that would save the need to extend the command line arguments of the protocol. That however doesn't change the need for an appropriately numbered and weighted reference structure which I don't think BSS prepares at the moment.

AdeleHardie commented 3 years ago

Following today's discussion, I'm linking the functions I use to prepare input PLUMED files for steered MD. renumber_pdb() and rmsd_reference() here are used to prepare the reference PDB structure, and steered_md_plumed_input() here prepares plumed.dat.

The CV itself (RMSD here) and the MOVINGRESTRAINT in PLUMED are separate so it should be easy to extend this to other CVs.

lohedges commented 3 years ago

Thanks, @AdeleLip, that's really useful. I'll see what's possible to re-implement within BioSimSpace.

FYI: I've just pushed an update that adds a .getFrame method to all BioSimSpace.Process classes. This allows you to extract individual trajectory frames from a process, rather than first having to get the entire trajectory, or pass the full paths to the trajectory and top files to BioSimSpace.Trajectory.getFrame. The indices are validated against the expected range and the method returns None if a corresponding frame isn't found. Note that the SOMD trajectory writer is a bit funky so there is sometimes a mismatch between the number of frames written in the header and what is actually in the file. GROMACS also is sometimes off by one too, i.e. there is one extra frame in the file.

lohedges commented 3 years ago

Yes, most functionality for doing this already exists in BioSimSpace so I'll cook up an example tomorrow. Do you have example input files for the system and target (RMSD reference) that you could share? (Perhaps they are already in the repository, but I missed them.)

Would the system and target normally be separate files to begin with, or could the target be extracted from the system with the coordinates replaced with the those of the desired target conformation? I'm just trying to figure out the logical workflow within BioSimSpace, i.e. would the user load two PDB files to start with, one for the system, one for the target, or could we actually create the target in BioSimSpace? (Apologies for any incorrect notation.)

AdeleHardie commented 3 years ago

Yes I've adjusted the trajectory analysis notebook to use getFrame() and it works great for extracting snapshots.

Here are some example input files. The system files are just results of equilibration, and the target is a crystal structure (with a bit of modification since I use it to prepare other systems).

The way I prepare RMSD reference is to simply take the PDB file and replace the atom indices with the corresponding atom indices from the system itself. I wrote that code very early on (before I used BSS), but it simply loops over all of the residues of the system and the target protein, and within each residue finds atoms with the same names, then changes target protein indices with the ones from the system. This assumes that the proteins have the same sequence start (e.g. in my case the target has to have an ACE cap as well) but can handle things like hydrogens missing from the reference PDB (so you can get residues with indices like [6, 8, 10] where the hydrogens were just ignored).

One of the things that it doesn't handle (which I think it really should) is to check that the sequences match, e.g. that the residues with the same index have the same name. Because of the nature of my system, the renumbering gets cut off when it reaches the NME cap, which in the target is a GLY, so only the N atom matches and gets carried over. It isn't used for RMSD calculations and doesn't affect the alignment for this particular system but this is a patchy solution.

It could also be possible to replace the system coordinates with those extracted from the PDB, and then save the BioSimSpace molecule as a PDB for plumed to use. I get the indices directly from the whole system PDB, which can be useful especially if the protein isn't the first molecule, but I remember you mentioning you have ways to deal with that.

lohedges commented 3 years ago

Thanks for the detailed explanation. For the example files that you provided there are residues in the target PDB that aren't found in the system. Using the Sire/BioSimSpace search functionality I find 284 out of the 301 residues. Is this to be expected? If so, can I safely ignore any unmatched residues and only update the indices of the ones that I find.

With regards to the PDB file required by PLUMED: Do you use the target PDB with the indices updated (and the beta and occupancy columns tweaked)? Would it also work if I wrote a PDB for the corresponding molecule in the system after updating the coordinates? (Perhaps only writing the residues that match.)

AdeleHardie commented 3 years ago

In this case the target protein (WPD loop closed) sequence is a bit longer than the system (WPD loop open) since the C terminus of the WPD loop open is disordered and we are modelling this protein truncated. It is safe to ignore any unmatched residues, since PLUMED doesn't care about continuity, just the atom index. In some of their RMSD examples they only pass the few atoms they want to use for the calculation (we need the others for system alignment, but it illustrates the point).

I use the target PDB with the indices updated and then the beta and occupancy columns tweaked. All PLUMED needs is that the indices are the same between the target and the system, and the two columns let it know which indices to use for alignment and which for RMSD calculation. So if it's easier in BSS to find the matching residues/atoms and update coordinates in some copy of the system that should work perfectly.

AdeleHardie commented 3 years ago

I have now added some more tutorial background in the setup-sMD notebook. Some of it is subject to change depending on what you decide to implement in BSS, but it might also give more background on how I prepare the simulation at the moment.

lohedges commented 3 years ago

Okay, here is an archive containing a hokey script that uses Sire/BioSimSpace to create a PDB file for PLUMED. Just run test.py in the extracted directory, which should create a test.pdb output (which is already included, in case you don't want to run it.) Could you check this file and see if it all looks okay?

The logic is as follows:

The code would be simplified massively if we could be sure that the target PDB file contained all residues from the matching molecule in the system, since we could use the ResIdxAtomNameMatcher from Sire.Mol. This would directly return a mapping between the atoms in the system that match those in the target. Here I guess the target PDB might contain a subset of the molecule, so we might be missing residues, or there could be gaps in the indexing. Let me know if this isn't the case and I'll simplify the code. (The current implementation could be causing by taking the first match it finds.)

I've not extensively tested this, so let me know if there is something obviously wrong with the logic. For example, I'm not sure if I'm correctly setting the occupancy and beta factors for atoms that aren't matched, or whether I'm including too many atoms in the PDB. (It contains the entire molecule from the system.) It should also work for cases where the matched molecule isn't the first in the system (which it is in this example). The PDB writer already uses the atom number, which accumulates across all molecules, rather than the index, which counts from 0 for each molecule.

It should be easy for me to revise this to get something that works a little more robustly.

Cheers.

lohedges commented 3 years ago

Note that if we don't need additional info in the PDB file, e.g. MODEL and CONECT records, then we can use Sire.IO.PDB2 directly and strip these from the output.

AdeleHardie commented 3 years ago

I think the problem at the moment is the beta and occupancy columns - with how they are now, PLUMED will use the atoms that do not exist in the target structure to align the system and the all the atoms that matched to calculate the RMSD.

I can see you are copying the molecule and updating coordinates. In that context the logic once you've created the atom mapping I think should be:

  1. Loop over each atom in the copy molecule
  2. If it does not exist in target, delete it
  3. If it does exist in the target but does not fall in the 'RMSD residue' range, set the occupancy column to 1.00 and beta to 0.00
  4. If it does exist in the target and does fall in the 'RMSD residue' range, set the occupancy column to 0.00 and beta to 1.00

At the moment the alignment will be carried out to the first frame of the system's H atoms and some left over residues, which wouldn't work to correctly align to the target structure, and the RMSD would be calculated to the entire protein, which also wouldn't make steering possible. I'm attaching example renumbered and modified target files to illustrate this (apologies, I should have included them before).

About simplifying the code, in this case it would be easy to make the target PDB contain only what it contained in the system, since I could just also truncate it and leave out the N terminus cap, that would only require deleting some lines. But in the cases where it is swapped around, i.e. the system has the loop closed (and 301 residues) and the target has the loop open (and 284 residues) that would involve adding those residues to the target structure, and I'm not sure if the additional residues would throw off alignments. This is what I meant by saying RMSD is annoying to work with...

I see the script does a lot of things to determine which molecule the target structure is for - could we just ask the users to specify this? From my own usage of the scripts, I wouldn't mind that at all. Then there could be two assumptions to make:

  1. All the residues in the target PDB appear in the system
  2. All the residues indices in the target PDB are the same as in the system (i.e. residue 20 in target is the same as residue 20 in system, even though some atoms may be missing)

I feel like those are reasonable expectations and would only need some modification when preparing target structure. This would be able to handle mismatched sequence length (assuming they are numbered accordingly) and also gaps in the system (full residues and one off atoms). The only issue for the user would be to make sure the residues are numbered correctly if there is an offset (e.g. sequence start is longer in the system or the target) or if there are any gaps. What do you think?

lohedges commented 3 years ago

Ah brilliant, thanks for the clarification, I thought I was missing something obvious ;-) I had also assumed that all atoms in the target PDB that appear in the system were used for the RMSD, rather than using a subset of the atoms using the "RMSD residue range". (Looking again at your code, this is obvious.)

I think both of the assumptions that you suggest are perfectly sensible and would massively simplify the problem. If we are clear with the expectations for the target PDB file, then it will likely save a lot of headaches. The only reason for trying to detect the target molecule was in case the system had been re-ordered since it was originally loaded. I am imagining a situation where the user might load a single molecule, then solvate it, do some other operations, delete things, recombine things, etc. If we make the assumption that all residues in the target PDB are in the system, then this would simplify the search too. (I imagine we could just take the molecule with the closest number of residues, which would likely work in the majority of cases.) We could provide an option to pass the index, which always takes precedence. (For example, this is what I do with the funnel metadyamics code when detecting the ligand.)

With regards to the "RMSD residue range": Is this always contiguous, or are there cases were you might want multiple ranges, or the ability to pass specific indices rather than a range, or a combination of indices and ranges? This would complicate the atom matching a little, but it could be broken up into stages for each contiguous chunk.

I'll rework things on Monday.

AdeleHardie commented 3 years ago

In my experience the RMSD residues have always been continuous, but I'm not sure if that would always be the case. Most other libraries I have used let you give specific indices and/or have some sort of selection language. I think anywhere where the topology is easily accessible, passing indices can give a lot of freedom and makes your work easier. That would also allow for modifications such as using backbone or heavy atoms only, which is common in RMSD calculations.

lohedges commented 3 years ago

Sorry for the slow update on this. I realised the steered-MD is generic to any collective variable supported by PLUMED, with targeted MD (using RMSD as the target) being a special case. I'm just working on making things general so that we can run steered-MD simulations with any of the collective variables that we currently support (besides the funnel).

Just looking at your example files above (from plumed_files.zip) I see that all of the occupancy values are 1.00, but the beta values are either 0.00 or some other floating point number, i.e. not just 0.00 or 1.00. For example:

HETATM    6  O   ACE A   1      74.956  14.219  22.683  1.00  0.00           O
HETATM    5  C   ACE A   1      74.425  13.962  21.610  1.00  0.00           C
HETATM    2  CH3 ACE A   1      75.217  13.607  20.377  1.00  0.00           C
ATOM      7  N   MET A   2      73.102  13.935  21.373  1.00  0.00           N
ATOM      9  CA  MET A   2      72.154  14.306  22.433  1.00  0.00           C
ATOM     22  C   MET A   2      71.075  15.176  21.826  1.00  0.00           C
ATOM     23  O   MET A   2      71.031  15.395  20.626  1.00  0.00           O
ATOM     11  CB  MET A   2      71.579  13.006  23.035  1.00  0.00           C
ATOM     14  CG  MET A   2      72.692  12.144  23.654  1.00  0.00           C
ATOM     17  SD  MET A   2      71.976  10.681  24.447  1.00  0.00           S
ATOM     18  CE  MET A   2      72.318  11.053  26.186  1.00  0.00           C
ATOM     24  N   GLU A   3      70.209  15.654  22.735  1.00 55.62           N
ATOM     26  CA  GLU A   3      69.400  16.881  22.481  1.00 55.73           C
ATOM     37  C   GLU A   3      68.743  16.815  21.109  1.00 54.13           C
ATOM     38  O   GLU A   3      69.379  16.458  20.115  1.00 53.13           O
ATOM     28  CB  GLU A   3      70.285  18.127  22.559  1.00 57.90           C
ATOM     31  CG  GLU A   3      69.546  19.445  22.375  1.00 61.02           C
ATOM     34  CD  GLU A   3      68.487  19.673  23.436  1.00 63.99           C
ATOM     35  OE1 GLU A   3      68.782  19.448  24.628  1.00 65.95           O
ATOM     36  OE2 GLU A   3      67.361  20.084  23.080  1.00 65.20           O
ATOM     39  N   MET A   4      67.462  17.156  21.064  1.00 51.17           N
ATOM     41  CA  MET A   4      66.717  17.149  19.818  1.00 48.94           C
ATOM     54  C   MET A   4      67.270  18.240  18.907  1.00 47.90           C
ATOM     55  O   MET A   4      67.182  18.145  17.684  1.00 45.73           O
...

Has something gone wrong here, or am I missing something? (I know that the two columns just represent weights, so can be any number, but your code and the description above suggests that they will always be 0 or 1.)

AdeleHardie commented 3 years ago

You are looking at renumbered.pdb which only has adjustments made to the atom index. The actual file used by PLUMED (should be in the PLUMED example file) is reference.pdb, where the beta and occupancy columns are all 0.00 or 1.00. The numbers in renumbered.pdb are whatever the original pdb had there. I have this in two steps because I was playing around with which residues to use for RMSD and so I just renumbered it once.

lohedges commented 3 years ago

Ah, perfect. I actually thought I had opened the reference file but my terminal autocomplete must have filled in renumbered instead when I hit tab. Bah :facepalm:

lohedges commented 3 years ago

I've pushed some changes to the feature_steered_md branch. Here I've implemented an RMSD collective variable that can be used for metadynamics simulations as well as steered MD, although I've yet to write the driver code for steered MD. You can pass in a system and reference molecule to the RMSD constructor and it will do the work of generating the reference PDB file for you, e.g.

import BioSimSpace as BSS

# Load a (boring) example system.
system = BSS.IO.readMolecules("amber/ala/*")

# Extract a molecule and translate it for fun.
reference = system[0]
reference.translate(3*[BSS.Units.Length.angstrom])

# Create a RMSD collective variable. Since we don't pass an index for the reference molecule the code
# chooses the molecule in the system with the closest number of residues. Here we've not specified any
# indices for the atoms involved in the RMSD calculation, so all are used. If the user passes a list of
# 'rmsd_indices' then any matching atoms that aren't in this list are used for alignment. We require that
# all of the atoms in the reference are present in the system and that the ordering of residues is the
# same.
rmsd = BSS.Metadynamics.CollectiveVariable.RMSD(system, reference)

# Print the list of PDB strings. These are written to 'reference.pdb' in the working directory when setting up
# a process for metadynamics or steered MD.
print(rmsd.getReferencePDB())

I'm now working on implementing the required moving restraint code for PLUMED. For now I'll probably limit this to a subset of the collective variables that we currently support. (Probably distance and RMSD.)

lohedges commented 3 years ago

Let me know if the constraints are too stringent, e.g. if we want to be able to handle reference molecules that have atoms that aren't in the system. (This is the case for your original example.) It would just be nice to have a way of knowing that we've obtained a good match without the user needing to look at the PDB file, especially for a large molecule. Perhaps we could add a flag to allow a mismatch in the atom number if the user knows that this is the case, e.g. when modelling a truncated version of a molecule.

lohedges commented 3 years ago

I'm also not sure whether it's best to specify the atoms that are involved in the RMSD calculation by index, or to use the residue index as you have done. The atom index seems more flexible, but possibly more work for the user. The rmsd_indices list is in refers to the reference molecule, so the user doesn't need to work out the actual index within the matching molecule in the system.

lohedges commented 3 years ago

Also, do we need any special handling for the case were all atoms in the reference are used for the RMSD, e.g. should both the occupancy and beta values be set to 1? Your examples always match a subset of the atoms, where we have 0/1 for the RMSD atoms, and 1/0 for the others. (The example in the PLUMED docs just uses 1 for both sets of weights.)

AdeleHardie commented 3 years ago

About specifying atoms vs whole residues - I think atom indices are the way to go. In my case I use all heavy atoms and the reference PDB does not have Hs so it's not as apparent, but I believe it's common to not use all atoms in a residue to calculate RMSD (e.g. only backbone atoms).

To handle the cases where all atoms are used for RMSD (and possibly allow more flexibility) the user could pass two lists, one with occupancy and one with beta values? It is possible users may want to both use atoms for alignment and for RMSD calculation, such as when they're using a large chunk of the protein for RMSD and the leftover atoms don't give good alignment. I'm not sure how common that would be though, so there could of course be an option to use 'all' indices that handles this specific case. I doubt there would be use for whole protein RMSD during steered MD, but it might be useful for post run analysis (if it's possible to use CVs for that, I haven't looked closely?)

I don't think having to have all reference atoms in the system is that constraining or difficult, but I guess allowing 'unmatched' atoms to be ignored would give it more flexibility?

lohedges commented 3 years ago

I've now created a BioSimSpace.Protocol.Steering class that can be used to set up a schedule for performing steered MD. I just need to convert this to the correct PLUMED file to implement the moving restraint, as well as making sure that I don't break anything related to our existing PLUMED metadynamics functionality.

lohedges commented 3 years ago

For simplicity I've decided to leave the RMSD CV as is for now. It should work for the purposes of the steered MD (assuming that inputs are edited so that the reference doesn't contain any atoms that aren't in the system). I'll think about making it more flexible after the workshop, i.e. allowing the user to customise the RMSD and alignment weights on a per-atom basis.

lohedges commented 3 years ago

With regards to output from PLUMED when running steered MD: It looks like you can just print exactly the same information as you would when running metadynamics, i.e. the time series values of the collective variables and the associated bias. If so, I should be able to re-use the existing code for getting this information dynamically from a running process.

lohedges commented 3 years ago

I've now got steered MD working with GROMACS in the feature_steered_md branch. I'll test it with your example input (I'll tweak the reference file to only include the atoms in the system) then post an example that you can try locally once I've merged.

AdeleHardie commented 3 years ago

Does the RMSD CV require that the reference have the same number of residues and non H atoms as the system molecule it is being aligned to? When I run the following:

import BioSimSpace as BSS
system = BSS.IO.readMolecules(['data/system.prm7', 'data/system.rst7'])
reference = BSS.IO.readMolecules('data/reference.pdb').getMolecule(0)

#get loop indices for rmsd
rmsd_indices = []
for residue in reference.getResidues():
    if 178<=residue.index()<=184:
        for atom in residue.getAtoms():
            rmsd_indices.append(atom.index())

rmsd_cv = BSS.Metadynamics.CollectiveVariable.RMSD(system, reference, 0, rmsd_indices)

I get the following error: Exception 'SireMol::missing_residue' thrown by the thread 'master'. There is no residue with the number "284" in the layout "{bcad4c83-4b71-4557-b057-5afa1ba6a277}".

The difference between the reference PDB and the protein in the system is that reference PDB does not have H atoms and has 283 residues while the system has 284. However, when I add a line with a dummy atom to make up a 284th residue, and rerun the same code, I get: IncompatibleError: Didn't match all of the atoms in the reference molecule: Found 2307, expected 2308.

Then I added another dummy atom to the 284th residue of the reference, but get the following: IncompatibleError: Didn't match all of the atoms in the reference molecule: Found 2307, expected 2309.

Changing one of the atom names in residue 284 of reference PDB to match one of the atom names in residue 284 of the system changes the error to: IncompatibleError: Didn't match all of the atoms in the reference molecule: Found 2308, expected 2309.

So it looks like at the moment it is expected that the molecules match up perfectly. I thought the CV would be able to handle differences in molecules as long as all the atoms in the reference were present in the system (which was the case in the first scenario). Could you clarify what the requirements are for the reference/system match for the CV?

(for reference, I'm using a version of BSS cloned from the feature_steered_md branch around noon today)

lohedges commented 3 years ago

The expectation is:

1) The reference can be matched to a single molecule in the system. (If the user passes an index, then this is clearly unambiguous.) 2) All atoms in the reference occur within the matching molecule in the system and have the same name. 3) Residues in the reference have the same numbering as those in the matching molecule in the system.

The above handles cases where the molecule has more atoms than the reference, but not cases where the reference does. If we need to support that, then I can remove the check. I guess we can raise a warning, since there might be situations where this indicates that something has gone wrong, or the reference / system is incorrect.

As a quick test, can you remove lines 164-166 of _rmsd.py and see if the output looks sane?

Are these the same example files as previously, if so, I can test here too.

lohedges commented 3 years ago

Atoms are matched by residue number and name, so the ordering within a residue can be different, or atoms missing from within a residue, i.e. heavy atoms only.

AdeleHardie commented 3 years ago

The system is the same, but the target.pdb file has only residues 1-283 (simple deleted the rest), which are the ones present in the system.

This makes me think that something goes wrong when the residue number does not match between the two. I can see why I'm getting the IncompatibleError when I add atoms that aren't present in the system, but then the first case (where reference only had 283 residues) should have worked?

lohedges commented 3 years ago

I'll try reproducing the same PDB here.

AdeleHardie commented 3 years ago

Does it take into account when an entire residue is missing?

lohedges commented 3 years ago

I thought we had these constraints:

All the residues in the target PDB appear in the system All the residues indices in the target PDB are the same as in the system (i.e. residue 20 in target is the same as residue 20 in system, even though some atoms may be missing)

Currently I'm matching residues by their number not their index. Since we are using a PDB file for the reference in most cases I thought that people might delete sections so that the indices would be out of sync, but the numbers would match.

AdeleHardie commented 3 years ago

So what goes wrong in this case? As I see it, it should match reference residue 1 to system residue 1, reference residue 2 to system residue 2, ..., reference residue 283 to system residue 283, and then there is no reference residue 284 to match to system residue 284. Could it be that it trips up by trying to match system 284 to a non-existing reference residue 284? I see it uses Sire.Mol.ResNumAtomNameMatcher() so I can't check the source.

lohedges commented 3 years ago

Let me check, I think I might be matching the other way round for convenience, which would indeed break in this case. I can just swap the order, then invert the results if so.

lohedges commented 3 years ago

Yes, that was the case. I had matched the other way round since it made my life easier with indexing in the rest of the code. I've now reversed it and inverted the dictionary. Could you try again.

AdeleHardie commented 3 years ago

That's works perfectly!

lohedges commented 3 years ago

Fantastic, I'm pleased to hear :+1:

Sorry for all the confusion with this. I was speed coding this morning while a toddler was at nursery and now I'm trying to cram in as much as possible before he wakes from his nap!

Let me know if you catch anything else, or think of anything else that could be simplified. In order to set up a simulation you'll need to do something like the following:

import BioSimSpace as BSS

ns = BSS.Units.Time.nanosecond

# Create a time schedule for the steering.
schedule = [10*ns, 20*ns, ...]

# Create restraints for each stage of the schedule. These are harmonic, and have a "value"
# in units of the corresponding collective varaible.
r0 = BSS.Metadynamics.Restraint(0.5, force_constant=100)
r1 = BSS.Metadynamics.Restraint(0, force_constant=1000)
restraints = [r0, r1, ...]

# Create the protocol. Here CV is your CV from above. You can also pass in the PLUMED
# 'verse' for each CV to say from which direction the restraint acts. This is essentially the
# same as the regular production protocol, with the addition of the extra arguments below,
# i.e. you can tune things like timestep, runtime, temperature, pressure, etc., in the same way.
protocol = BSS.Protocol.Steering(cv, schedule, restraints)

# Now create a GROMACS process. (I could wrap this with BSS.SteeredMD.run if needed.)
process = BSS.Process.Gromacs(system, protocol)

I'll try to test this for your example too.

AdeleHardie commented 3 years ago

No worries! So far the steering protocol is giving me the expected PLUMED outputs and from the first look the reference seems renumbered correctly. Just to clarify, the original RMSD of the system to reference that we specify in the restraints, the CV doesn't calculate that? Are we expecting the user to come in having calculated that prior?

lohedges commented 3 years ago

Hmm, I see. I didn't quite realise that the schedule in your example (and some of the PLUMED examples) use an intitial vaiue at STEP0=0 that equals the value of the CV in the initial configuration. Is this always required, or can you simply use STEP0 at some non-zero step with a sane value of the CV (but not exactly the value at the start). I had just assumed that you could choose whatever schedule you want, although it wouldn't work well if you tried to move the restraint on the CV too abruptly.

If you do need the initial CV, then it wouldn't be too complicated to calculate it for the user and add it to the schedule by default, at least for simple things like distance and RMSD.

Toddler is now awake so I'll think more about this when I next get a chance.

AdeleHardie commented 3 years ago

I think as long as the initial CV value isn't crazy it doesn't matter much. Technically if we look at how the steering schedule is set up, the first value doesn't matter, because the steering force constant is 0 in the first step, so no matter what the whole bias will be 0. I calculate the initial RMSD via mdtraj, which must be slightly different than what PLUMED gets. I will play around with what's easiest for the tutorial (just give a close value or calculate it in the notebook).

lohedges commented 3 years ago

Great. This is super easy for me to do internally with Sire so I'll compute the initial value of the CV when setting it up, then add a cv.getInitialValue() function. I also need to fix the units of the CV (currently unitless). Will do that when I next get a chance.

lohedges commented 3 years ago

I managed to quickly add this. You can now do something like:

nm = BSS.Units.Length.nanometer
r0 = BSS.Metadynamics.Restraint(cv.getInitialValue(), force_constant=0)
r1 = BSS.Metadynamics.Restraint(0.5*nm, force_constant=1000)
lohedges commented 3 years ago

Let me know when you've had a chance to test things. If it mostly looks good then I'll merge across to devel so that I can finish any tweaks for the rest of the tutorials there. (I'm happy to edit mostly working functionality on devel, but don't want to merge in things that are completely broken or nonsense.)

In your tutorial it is probably important to highlight the specification for the target molecule, i.e. how a user should format a PDB (although it should work with any other format too) if they wanted to set something up for another system. I've written some information in the docstrings for the various objects, so it would be good to make sure that this is consistent.

AdeleHardie commented 3 years ago

I'm running into an issue with the initial RMSD evaluation. Running the same code as above:

import BioSimSpace as BSS
system = BSS.IO.readMolecules(['data/system.prm7', 'data/system.rst7'])
reference = BSS.IO.readMolecules('data/reference.pdb').getMolecule(0)

#get loop indices for rmsd
rmsd_indices = []
for residue in reference.getResidues():
    if 178<=residue.index()<=184:
        for atom in residue.getAtoms():
            rmsd_indices.append(atom.index())

rmsd_cv = BSS.Metadynamics.CollectiveVariable.RMSD(system, reference, 0, rmsd_indices)

I get the following error:

~/sire.app/lib/python3.7/site-packages/BioSimSpace-2020.1.0+310.g1ad1b9f-py3.7.egg/BioSimSpace/Metadynamics/CollectiveVariable/_rmsd.py in __init__(self, system, reference, reference_index, rmsd_indices, hill_width, lower_bound, upper_bound, grid, alignment_type, pbc)
    171 
    172         # Work out the value of the RMSD.
--> 173         rmsd = reference._sire_object.evaluate().rmsd(molecule._sire_object, _SireMol.AtomResultMatcher(matches))
    174         self._initial_value = _Length(rmsd.value(), "angstrom")
    175 

StopIteration: Exception 'SireError::invalid_index' thrown by the thread 'master'.
No item at index 3524. Index range is from -2307 to 2306.
Thrown from FILE: /home/adele/software/Sire/corelib/src/libs/SireID/index.cpp, LINE: 120, FUNCTION: void SireID::IndexBase::throwInvalidIndex(qint32) const

Looking at the results from Sire.Mol.AtomResultMatcher(matches) the indices look to be reasonable mapped: AtomResultMatcher( { AtomIdx(1789) : AtomIdx(3524), AtomIdx(1790) : AtomIdx(3526), AtomIdx(1791) : AtomIdx(3527), AtomIdx(1784) : AtomIdx(3515), AtomIdx(1785) : AtomIdx(3516), AtomIdx(1786) : AtomIdx(3517), AtomIdx(1787) : AtomIdx(3519), AtomIdx(1780) : AtomIdx(3507), AtomIdx(1781) : AtomIdx(3509), AtomIdx(1782) : AtomIdx(3511), AtomIdx(1783) : AtomIdx(3514), AtomIdx(1776) : AtomIdx(3502)...

lohedges commented 3 years ago

Could you post the files you are using and I'll try here.(I think these are your existing files with some lines deleted from the PDB.) It's possible that the order of the molecules needs to be reversed, i.e. the opposite problem to what we had before :-)

lohedges commented 3 years ago

I just tried here with some arbitrary molecules and the evaluator doesn't complain when I give indices that are out of range, i.e. it just returns an RMSD of 0. Strange.

AdeleHardie commented 3 years ago

Here are the files I've been using.

I also tried doing it the other way round:

from Sire import Mol
matcher = Mol.ResNumAtomNameMatcher()
matches = matcher.match(reference._sire_object, system.getMolecule(0)._sire_object)
system.getMolecule(0)._sire_object.evaluate().rmsd(reference._sire_object, Mol.AtomResultMatcher(matches))

but I get a similar error:

StopIteration: Exception 'SireError::invalid_index' thrown by the thread 'master'.
Invalid index: 8 - maximum index == 8.
Thrown from FILE: /home/adele/software/Sire/corelib/src/libs/SireVol/coordgroup.cpp, LINE: 1752, FUNCTION: void SireVol::CoordGroupBase::assertValidIndex(quint32) const
AdeleHardie commented 3 years ago

Also looking at the cv.getInitialValue() source made me wonder whether the RMSD calculated is of the whole protein or just the specified range? matches looks to be a dict of all atom mappings so it seems like the former?

lohedges commented 3 years ago

Yes, you're right! I'll put that later once I've worked out the partial molecule match. I'm still surprised that it's failing though, since the indies exist in the molecule.

I'll update the code and report back.