mosdef-hub / mbuild

A hierarchical, component based molecule builder
https://mbuild.mosdef.org
Other
173 stars 80 forks source link

Generate from SMILES + reproducibility #838

Closed rsdefever closed 3 years ago

rsdefever commented 3 years ago

As brought up by @chrisiacovella in a recent meeting, the SMILES --> 3d structure functionality does not generate the same 3d structure each time. I've dug around openbabel a little bit, but I haven't come across any way to specify a seed or anything else that would guarantee the same structure each time.

IMO this is actually a rather big problem. The load from SMILES functionality is very useful, but makes entire workflows entirely un-reproducible if we start from different coordinates every time.

Are there any other mbuild functions that might suffer similar problems? I'm wondering about the PACKMOL functionality -- can any speak to if that is deterministic?

daico007 commented 3 years ago

pretty sure packing function is also non-deterministic, but I am not sure if there is a seed variable somewhere that we can specify.

rsdefever commented 3 years ago

@daico007 I'm not so sure. A quick test, for the simple fill_box anyway, generates the same structure every time. My (extremely limited) understanding is that packmol uses a constrained optimization problem to identify the coordinates. This should result in the same solution every time AFAIK. But perhaps some of the more complicated features of packmol become non-deterministic. Here is a MWE, at least for the fill_box functionality:

import mbuild
ethane = mbuild.load("CC", smiles=True)
ethane.save("ethane.pdb")

ethane = mbuild.load("ethane.pdb")
box = mbuild.fill_box(ethane, n_compounds=500, box=[4, 4, 4])
box.save("box1.pdb")

ethane = mbuild.load("ethane.pdb")
box = mbuild.fill_box(ethane, n_compounds=500, box=[4, 4, 4])
box.save("box2.pdb")

ethane = mbuild.load("ethane.pdb")
box = mbuild.fill_box(ethane, n_compounds=500, box=[4, 4, 4])
box.save("box3.pdb")
daico007 commented 3 years ago

you're right, I just checked all the packing functions and we do specify a seed for them, so the resulting systems are the same every time. Then this is not something we have to worry about, at least for now.

daico007 commented 3 years ago

Also, this may be related to the SMILES issues https://github.com/rdkit/rdkit/issues/2012

rsdefever commented 3 years ago

Also, for anyone who is curious about the lack of reproducibility in the SMILES generation, here is an easy MWE:

import mbuild
ethane = mbuild.load("CC", smiles=True)
ethane.save("ethane1.pdb")

ethane = mbuild.load("CC", smiles=True)
ethane.save("ethane2.pdb")

ethane = mbuild.load("CC", smiles=True)
ethane.save("ethane3.pdb")

A quick diff on the file will show that they all have different coordinates.

rsdefever commented 3 years ago

Also, this may be related to the SMILES issues rdkit/rdkit#2012

Don't we use openbabel for SMILES --> structure?

daico007 commented 3 years ago

True, my mistake

ahy3nz commented 3 years ago

Yeah, it looks like the code openbabel/pybel uses to convert a 2D structure into 3D doesn't have anything immediately apparent for deterministic/reproducible construction unless you mess with the underlying OBBuilder (doesn't seem easily accessible).

It might be better to use rdkit and its random seed for embedding/creating 3d molecules (snippet)

from rdkit import Chem
from rdkit.Chem import AllChem

for i in range(3):
    m = Chem.MolFromSmiles("CC")
    m2 = Chem.AddHs(m)
    AllChem.EmbedMolecule(m2,randomSeed=0)
    print(m2.GetConformer(0).GetPositions())

Getting atoms/bonds from the rdkit mol object, and then getting coordinates from the conformer object seems also like a doable module for any rdkit-mbuild conversion

jennyfothergill commented 3 years ago

With openbabel the coordinates are different but iirc the order of the generated particles is the same. (It's hard to tell with ethane, but we can test with a nonsymmetrical molecule like ethanol.) If the order is the same, I think we could standardize the coordinates based on the first and second particles.

edit: but of course, rdkit may be a better solution!

rsdefever commented 3 years ago

If the order is the same, I think we could standardize the coordinates based on the first and second particles.

I'm actually less sure about this, because IIRC it uses some sort of conformer generation scheme. If that is the case, its not just a simple translation/rotation of the coordinates. But I may be wrong.

Getting atoms/bonds from the rdkit mol object, and then getting coordinates from the conformer object seems also like a doable module for any rdkit-mbuild conversion

Yes, it seems like rdkit may be the way to go. The only other thing we use openbabel for is the energy_minimize. Does it make sense to have rdkit and openbabel or should we try to go with one or the other only?

mikemhenry commented 3 years ago

I'm not sure about what backend we should use for energy_minimize but using rdkit for 3d structure generation sounds like a good idea. If openmm is going to stay a hard dependency for mbuild, then that can be used for 3D structure generation with the added bonus that the forcefiled used for the 3D structure can be user supplied.

umesh-timalsina commented 3 years ago

It looks like we might not need a switch of libraries but rather a smart axes transform such that the positions are uniform. All these coordinates given by openbabel are equidistant as shown in the snippet below.

>>> import mbuild as mb
>>> hydroxy_propylines = [mb.load("CC=CO", smiles=True) for i in range(10)]
>>> norms = []
>>> import numpy as np
>>> for hydroxy_propyline in hydroxy_propylines:
...     norms.append(np.linalg.norm(hydroxy_propyline.xyz - hydroxy_propyline.xyz[0], axis=1))
... 
>>> for norm in norms:
...     for norm_copy in norms:
...            assert np.allclose(norm, norm_copy)

Since, all these particle positions are equidistant from one another. we might need the following as suggested by @rsdefever.

1. Translate the entire molecule so the first atom is at (0, 0, 0)
2. Compute the rotation needed to place the second atom along the (1, 0, 0) vector
3. Compute the rotation needed to place the third atom in the (0, 1, 0) plane

For this to work, the atom indexes should also be the same, needs further investigation.

rsdefever commented 3 years ago

So I looked at this. Here is an example solution. The problem, however, is that although the molecules are "basically" perfectly aligned, there are still some small differences in the coordinates (use diff to see). Now of course these differences are "trivial", but it could prevent it from being exactly reproducible :(. I'm open to other suggestions...

import numpy as np
import mbuild
from scipy.spatial.transform import Rotation

for imol in range(5):
    mol = mbuild.load("C(F)(F)(F)C(F)F", smiles=True)
    mol.save(f"unaligned-{imol}.xyz", overwrite=True)
    coords = np.array(mol.xyz, dtype=np.float64)
    coords = coords - coords[0]
    normal = np.cross(coords[1]-coords[0], coords[2]-coords[0])
    a = np.array([[1, 0, 0], [0, 1, 0]], dtype=np.float64)
    b = np.array([coords[1], normal], dtype=np.float64)
    R, rmsd = Rotation.align_vectors(a, b)
    coords = R.apply(coords)
    mol.xyz = coords
    mol.save(f"aligned-{imol}.xyz", overwrite=True)

Edit: What this code does (1) place first atom at (0, 0, 0), (2) compute the normal vector normal to the plane containing atoms 0, 1, and 2, and (3) compute the rotation needed to make the atom0--atom1 vector point along (1, 0, 0) and the normal to the atom0--atom1--atom2 plane point along (0, 1, 0).

mikemhenry commented 3 years ago

I think maybe we first try and make things better, then we can work out how to improve reproducibility even more.

If rdkit is already a hard dependency then switching to it solves the problem.
If rdkit is going to create a new dependency, then normalizing portions before minimizing energy does help and we should do that.

Having a utility function that normalizes positions is pretty handy and I've seen it used in QCC packages, so it wouldn't hurt to have.

rsdefever commented 3 years ago

@mikemhenry rdkit is not currently a dependency.

I think maybe we first try and make things better, then we can work out how to improve reproducibility even more.

I'm torn on this; I agree in principle, but on the other hand, the problem we have here is that you don't get the exact same result every time you run it and, IMO, being closer to the same result still leaves you were you started -- not the same result and not reproducible. For example, if I start an MC or MD simulation from close to the same point the trajectories will quickly diverge anyway.

mattwthompson commented 3 years ago

SMILES --> 3d structure functionality does not generate the same 3d structure each time

What is the threshold for "same" here? I can think of several variables that would immediately invalidate a strict guarantee of identical coordinates (cheminformatics toolkit, toolkit version, conformer generation method, operating system) even with something stable like RDKit. I don't know if using the same conda environment, same operating system, and same random seeds would be enough. Conformer generation is a huge black box of complexity and this is probably just the tip of the iceberg.

My $0.02 is that ensuring identical initial coordinates is nearly unobtainable and that getting "close enough" (such that a QM or MM optimization will produce reasonably similar results) is a fine contract with the user w/r/t reproducibility. My general opinion is similar for molecular simulation in general; everything (toy systems aside) is going to diverge a few timesteps into a simulation and therefore reproducing the physical properties is left as the only attainable objective. Others may disagree; I just wanted to put my bias out there as it heavily informs this topic.

rsdefever commented 3 years ago

What is the threshold for "same" here?

I'd hope we could get the same exact coordinates given the same seed. This seems like a small ask.

My general opinion is similar for molecular simulation in general; everything (toy systems aside) is going to diverge a few timesteps.

Although I'm aware these systems are Lyapunov unstable, I'm not as pessimistic as you here. Cassandra will generate the exact same output over and over with the same input files and seeds. I just ran one of our (not toy) systems in LAMMPS and it generated identical outputs for over 100 ps of simulation time. If you have a deterministic system, do things in double precision, and run your simulations carefully, you can get the same output. Again, I'm aware that things like compiler options or parallelization schemes can have some impact.

reproducing the physical properties is left as the only attainable objective.

Despite my above commentary I do agree that this is the primary objective.

jpotoff commented 3 years ago

I have to agree with @rsdefever here. In our validation process for GOMC, we have a number of defined test systems that we use to evaluate specific functions in the code. We expect the code to produce exactly the same results at Monte Carlo step. It is a very quick and easy way to make sure you didn't break the code when you added a new feature. It's also very helpful for validating the OpenMP and GPU parts of the code. While predicting physical properties from statistical averages is also useful, it takes a long time, and potentially can hide subtle bugs. If mbuild cannot generate identical coordinates (from the smiles string) every time you run your workflow, it's simply not usable for doing the first pass of validation on a simulation engine. Instead, one would have to load a mol2 file, it seems. That's doable, but not as easy as a SMILES string. Or you do what we've been doing and generate initial configurations for whatever your test cases are and just use those forever.

jennyfothergill commented 3 years ago

https://github.com/openbabel/openbabel/issues/1934 oooh this issue is not new

mikemhenry commented 3 years ago

~In that issue it looks like setting an envar fixes it https://github.com/openbabel/openbabel/issues/1934#issuecomment-644515730~

Sorry missed that PR hasn't been merged yet!

rsdefever commented 3 years ago

@mikemhenry the PR isn't merged...

Also, clunky fix.