openforcefield / cmiles

Generate canonical molecule identifiers for quantum chemistry database
https://cmiles.readthedocs.io
MIT License
23 stars 7 forks source link

[WIP] Fix some problems with atom map indices #15

Closed ChayaSt closed 5 years ago

ChayaSt commented 5 years ago

Description

The PR addresses several issues with atom map indices.

  1. Get an atom map for a molecule to the map indices on the mapped SMILES {MapIdx:Idx} There are 2 ways to get an atom map {MapIdx:atomIdx}:

    • A. Substructure search Problem: For symmetrical molecules, the map indices can flip.
      • Substructure searches should only be used as a way to recover order of molecules that for some reason cannot be reordered to canonical order.
    • B. From map indices on molecule Problem: How do we guarantee that the map indices are the canonical map indices? Solution: Have one function to canonicalize atom indices. Then check that map indices are +1 of atom indices. If not - the map indices are arbitrary and should not be used to generate qcschema geometry This will only work if the initial mapped SMILES was generated with the same toolkit as the canonical order since the canonicalization order algorithms are different. This cannot work for rdkit because there is no way to reorder the atoms. RDKit's canonical order ranking of atoms are the canonical map indices.
  2. Conformations generated from molecules that have map indices are different than ones generated without map indices. From openeye support:

    The coordinates are generally reproducible (i.e the same) because OMEGA usually starts from pre-built fragments from the frag lib. This is done via a text match using the isomeric SMILES. However, molecules with map indices will fail the isomeric SMILES lookup because of the indices themselves in the SMILES string. So therefore, the fragments are regenerated instead via an optimized DG (distance geometry) algorithm, which can lead to differences in the 3D coordinates. To remedy this, you would need to blow away the map indices and then run OMEGA. You can do this by iterating through the atoms and setting the map index for each atom to 0. If you need to keep the map index information, you can simply set this as generic data on each atom before blowing away the map index. Then, you can run omega and post process the results by going back in and resetting the map indices from the generic data.

  3. has_atom_map returned False even if only one atom did not have a map index.

Todos

Notable points that this PR has either accomplished or will accomplish.

Status

codecov-io commented 5 years ago

Codecov Report

Merging #15 into master will decrease coverage by 0.43%. The diff coverage is 94.11%.

ChayaSt commented 5 years ago

@dgasmith, thanks for the review but the request was my mistake. I meant to ask @j-wags since he's taking over maintenance of this repo.

ChayaSt commented 5 years ago

We decided not to worry about self symmetries when generating atom maps as long as the molecules are chemically equivalent.

In get_atom_map, instead of using the mapped SMILES as the SMARTS pattern for substructure search, I create a molecule from the mapped SMILES so the atoms not have the atom map, then canonicalize the order with canonical_order_atoms and used this molecule as the pattern for the substructure search. This seems to give the same order without flipping symmetrical atoms for the molecules I tested, but there is no guarantee that this is exhaustive.

Edit: Actually, it still flips the atoms but that's OK.

ChayaSt commented 5 years ago

The following changes were made in this PR to address the issue in https://github.com/openforcefield/fragmenter/issues/27

  1. When strict is True, to_molecule_ids does not return a mapped SMILE unless the input molecule has geometry
  2. We decided not to worry about symmetric flips that happen with the substructure search. Therefore, get_atom_map will use a substructure search the geometry will be ordered with the atom map.