openforcefield / openff-qcsubmit

Automated tools for submitting molecules to QCFractal
https://openff-qcsubmit.readthedocs.io/en/latest/index.html
MIT License
26 stars 4 forks source link

ConformerRMSDFilter incorrectly orders atoms #223

Closed amcisaac closed 2 weeks ago

amcisaac commented 1 year ago

In the ConformerRMSDFilter, the filter consolidates separate records that all correspond to conformers of the same molecule into a single molecule instance using the following code:

    conformers = [
         molecule.canonical_order_atoms().conformers[0]
        for _, _, molecule, _ in entries
    ]

    [_, _, molecule, _] = entries[0]

    molecule = copy.deepcopy(molecule)
    molecule._conformers = conformers

However, it only canonicalizes the order of the individual conformers, not the underlying molecule, leading to erroneous energies in some molecules. I believe it should be:

    molecule = copy.deepcopy(molecule.canonical_order_atoms())

Which has fixed the erroneous energies in my case.

jthorton commented 1 year ago

Hi @amcisaac, good find. I think that with the default settings on the filter, the results should be the same (see my example below) as the filter only needs the RMSD matrix to be calculated with the conformers in a consistent ordering. This ordering is discarded after the filter is complete as we want the records' IDs that pass the filter only. It's probably safer to have everything in a consistent ordering however so Ill look into updating this!

"Check if missing the canonical ordering of atoms doesn't change the outcome of the filter"
from openff.toolkit.topology import Molecule
from openff.qcsubmit.results.filters import ConformerRMSDFilter
import copy
import numpy as np

# load random molecule
mol = Molecule.from_smiles("CCO/C=C/F", allow_undefined_stereo=True)
mol.generate_conformers(n_conformers=10)
# makes 2 unique conformers locally
print(mol.n_conformers)

# replicate the filter copy the molecule and add reordered conformers
mol2 = copy.deepcopy(mol)
mol2._conformers = mol.canonical_order_atoms().conformers

# make sure they are now different
assert np.allclose(mol.conformers[0].m, mol2.conformers[0].m) is False

filter_method = ConformerRMSDFilter()
# run the filter to get the rmsd matrix
rmsd_wrong_order = filter_method._compute_rmsd_matrix(molecule=mol2)
print(rmsd_wrong_order)

# get the matrix in the consistent ordering
rmsd_right_order = filter_method._compute_rmsd_matrix(molecule=mol)
print(rmsd_right_order)
# make sure they are the same
assert np.allclose(rmsd_wrong_order, rmsd_right_order) is True
amcisaac commented 1 year ago

Ah I see, that makes sense. I copied some of this code to make a filter that discarded high-energy conformers. There, the atom ordering mattered, but if you're only using the RMSD between the coordinates and not the identity of the atoms or bonding pattern, it makes sense that it wouldn't matter. Thanks for looking into it!

amcisaac commented 3 weeks ago

Hi, I wanted to revive this issue as I've run into this behavior again. It looks to me like you are definitely right that, if all the filter was doing was taking the straight RMSD between conformer geometries, the atom order of the overall molecule wouldn't matter. However, I've found that it does matter (sometimes a lot!) for the check_automorphs=True feature (which is the default).

I'm attaching a minimally reproducing example, this is a dataset of one molecule (from the industry benchmark dataset) where all the conformers are very similar (with RMSD < 0.4 A). Using the default behavior of the filter, which re-orders the atoms in the conformers but not the molecule, the filter only finds one conformer to be close to the others. However, when re-ordering the atoms in the molecule and conformers, the RMSD of all conformers is low and the filter correctly eliminates all but one. I'm attaching the dataset (9 conformers), environment, and a notebook showing the behavior.

I am not sure why this didn't matter for the example above. Perhaps just because the molecule is small, it was a lucky choice? Or, perhaps I just found a niche case where it does matter.

molid2002 qm_qm_match.json

# Generic imports
from openff.qcsubmit.results import OptimizationResultCollection
from openff.qcsubmit.results.filters import ConformerRMSDFilter
from openff.toolkit import Molecule, Topology, ForceField
import copy, itertools
import numpy as np

# This is just yoinked from ConformerRMSDFilter
def rmsd_rdkit(molecule,check_automorphs,heavy_atoms_only):

    from rdkit import Chem
    from rdkit.Chem import AllChem

    rdkit_molecule: Chem.RWMol = molecule.to_rdkit()

    if heavy_atoms_only:
        rdkit_molecule = Chem.RemoveHs(rdkit_molecule)

    n_conformers = len(molecule.conformers)
    conformer_ids = [conf.GetId() for conf in rdkit_molecule.GetConformers()]

    rmsd_matrix = np.zeros((n_conformers, n_conformers))

    for i, j in itertools.combinations(conformer_ids, 2):
        if check_automorphs:
            rmsd_matrix[i, j] = AllChem.GetBestRMS(
                rdkit_molecule,
                rdkit_molecule,
                conformer_ids[i],
                conformer_ids[j],
            )

        else:
            rmsd_matrix[i, j] = AllChem.GetConformerRMS(
                rdkit_molecule,
                conformer_ids[i],
                conformer_ids[j],
            )

    rmsd_matrix += rmsd_matrix.T
    return rmsd_matrix
small_test_ds = OptimizationResultCollection.parse_file('qm_qm_match.json')
small_test_ds.n_results # 9 results 

# This is the regular filter
filter_small = small_test_ds.filter(ConformerRMSDFilter(rmsd_tolerance=0.4))
filter_small.n_results # 8 results

# Filter where atoms in the molecule and the conformers were canonically ordered
filter_small_canonical = small_test_ds.filter(ConformerRMSDFilterCanonical(rmsd_tolerance=0.4))
filter_small_canonical.n_results # 1 result

# Canonically ordered atoms in molecule and conformer, but no automorph checking
filter_small_canonical_nosym = small_test_ds.filter(ConformerRMSDFilterCanonical(rmsd_tolerance=0.4,check_automorphs=False))
filter_small_canonical_nosym.n_results # 8 results

# Regular filter, no automorph checking
filter_small_noncanonical_nosym = small_test_ds.filter(ConformerRMSDFilter(rmsd_tolerance=0.4,check_automorphs=False))
filter_small_noncanonical_nosym.n_results # 8 results
small_records = small_test_ds.to_records()

# Manually examining RMSDs
conformers = [
    molecule.canonical_order_atoms().conformers[0]
    for _, molecule in small_records
]

# No re-ordering molecule
mol1_off_no_reorder = copy.deepcopy(small_records[0][1])
mol1_off_no_reorder._conformers = conformers
rmsd_rdkit(mol1_off_no_reorder,True,True)
# array([[0.        , 1.20015246, 0.00510467, 0.76532656, 0.7916278 ,
#        1.35209557, 1.41521657, 1.01059229, 1.35047835],
#       [1.20015246, 0.        , 1.20126336, 1.17971699, 1.18125432,
#        1.01050942, 0.94675621, 1.35049057, 1.02755311],
#       [0.00510467, 1.20126336, 0.        , 0.76575377, 0.79288019,
#        1.35242334, 1.41540002, 1.01078876, 1.3508073 ],
#       [0.76532656, 1.17971699, 0.76575377, 0.        , 0.76627147,
#        1.34621859, 1.42404674, 0.66107904, 1.34775795],
#       [0.7916278 , 1.18125432, 0.79288019, 0.76627147, 0.        ,
#        1.34622026, 1.19592974, 1.01182332, 1.34621856],
#       [1.35209557, 1.01050942, 1.35242334, 1.34621859, 1.34622026,
#        0.        , 0.6436281 , 1.17272522, 0.76631334],
#       [1.41521657, 0.94675621, 1.41540002, 1.42404674, 1.19592974,
#        0.6436281 , 0.        , 1.26531108, 0.64403115],
#       [1.01059229, 1.35049057, 1.01078876, 0.66107904, 1.01182332,
#        1.17272522, 1.26531108, 0.        , 1.17449039],
#       [1.35047835, 1.02755311, 1.3508073 , 1.34775795, 1.34621856,
#        0.76631334, 0.64403115, 1.17449039, 0.        ]])

# Re-ordering both conformers and molecule
mol1_off_canonical_order = copy.deepcopy(small_records[0][1].canonical_order_atoms())
mol1_off_canonical_order._conformers = conformers

rmsd_rdkit(mol1_off_canonical_order,True,True)
# array([[0.00000000e+00, 2.80454897e-01, 5.10467129e-03, 1.41353389e-01,
#        1.41814635e-01, 1.41888909e-01, 2.37335377e-03, 1.41755408e-01,
#        1.41788429e-01],
#       [2.80454897e-01, 0.00000000e+00, 2.85235140e-01, 1.39833201e-01,
#        1.39370589e-01, 1.39297556e-01, 2.82618299e-01, 1.39432974e-01,
#        1.39397939e-01],
#       [5.10467129e-03, 2.85235140e-01, 0.00000000e+00, 1.46184175e-01,
#        1.46641011e-01, 1.46720239e-01, 2.74533358e-03, 1.46587668e-01,
#        1.46620409e-01],
#       [1.41353389e-01, 1.39833201e-01, 1.46184175e-01, 0.00000000e+00,
#        6.41699284e-04, 5.98137050e-04, 1.43539944e-01, 6.31194565e-04,
#        4.92121821e-04],
#       [1.41814635e-01, 1.39370589e-01, 1.46641011e-01, 6.41699284e-04,
#        0.00000000e+00, 4.88418446e-04, 1.43998633e-01, 6.49898551e-04,
#        5.42868861e-04],
#       [1.41888909e-01, 1.39297556e-01, 1.46720239e-01, 5.98137050e-04,
#        4.88418446e-04, 0.00000000e+00, 1.44075764e-01, 2.87133583e-04,
#        2.03034851e-04],
#       [2.37335377e-03, 2.82618299e-01, 2.74533358e-03, 1.43539944e-01,
#        1.43998633e-01, 1.44075764e-01, 0.00000000e+00, 1.43942843e-01,
#        1.43975681e-01],
#       [1.41755408e-01, 1.39432974e-01, 1.46587668e-01, 6.31194565e-04,
#        6.49898551e-04, 2.87133583e-04, 1.43942843e-01, 0.00000000e+00,
#        3.78306306e-04],
#       [1.41788429e-01, 1.39397939e-01, 1.46620409e-01, 4.92121821e-04,
#        5.42868861e-04, 2.03034851e-04, 1.43975681e-01, 3.78306306e-04,
#        0.00000000e+00]])

# Comparing both, without checking automorphs
rmsd_rdkit(mol1_off_no_reorder,False,True)
rmsd_rdkit(mol1_off_canonical_order,False,True)
# Both output:
# array([[0.        , 1.20015246, 0.00510467, 0.76532656, 0.7916278 ,
#        1.35209557, 1.41521657, 1.01059229, 1.35047835],
#       [1.20015246, 0.        , 1.20126336, 1.17971699, 1.18125432,
#        1.01050942, 0.94675621, 1.35049057, 1.02755311],
#       [0.00510467, 1.20126336, 0.        , 0.76575377, 0.79288019,
#        1.35242334, 1.41540002, 1.01078876, 1.3508073 ],
#       [0.76532656, 1.17971699, 0.76575377, 0.        , 0.76627147,
#        1.34621859, 1.42404674, 0.66107904, 1.34775795],
#       [0.7916278 , 1.18125432, 0.79288019, 0.76627147, 0.        ,
#        1.34622026, 1.19592974, 1.01182332, 1.34621856],
#       [1.35209557, 1.01050942, 1.35242334, 1.34621859, 1.34622026,
#        0.        , 0.6436281 , 1.17272522, 0.76631334],
#       [1.41521657, 0.94675621, 1.41540002, 1.42404674, 1.19592974,
#        0.6436281 , 0.        , 1.26531108, 0.64403115],
#       [1.01059229, 1.35049057, 1.01078876, 0.66107904, 1.01182332,
#        1.17272522, 1.26531108, 0.        , 1.17449039],
#       [1.35047835, 1.02755311, 1.3508073 , 1.34775795, 1.34621856,
#        0.76631334, 0.64403115, 1.17449039, 0.        ]])

python environment:

name: debug-rmsd-filter
channels:
  - conda-forge
dependencies:
  - python

  # MM calculations
  - openff-qcsubmit
  - openff-toolkit
  - jupyter
  - nglview