openforcefield / cmiles

Generate canonical molecule identifiers for quantum chemistry database
https://cmiles.readthedocs.io
MIT License
23 stars 7 forks source link

Restrict input to QCSchema and isomeric, explicit H SMILES to avoid ambiguity #13

Closed ChayaSt closed 5 years ago

ChayaSt commented 5 years ago

Description

To ensure that ionization / protonation / tautomeric states and stereochemistry states are unambiguous, inputs are restricted to:

  1. QCSchema molecules that have symbols, geometry and connectivity fields
  2. Isomeric, explicit hydrogen SMILES

Notable points that this PR has either accomplished or will accomplish.

Questions

codecov-io commented 5 years ago

Codecov Report

Merging #13 into master will increase coverage by 7.5%. The diff coverage is 93.25%.

ChayaSt commented 5 years ago

When indexing molecules already in QCArchive, the map should ideally correspond to the input geometry. But in all other cases, we should permute the geometry to correspond to the map from the canonical atom order. This creates some inconsistency in the output where the mapped SMILES will be different depending on an option. @dgasmith, do you think it makes sense to also permute the QCSchema geometry so that all mapped SMILES are consistent? This will make it easier to compare xyz coordinates for future inputs.

to_molecule_id() now takes a flag, permute_xyz. If set to True, it will return the QCSchema molecule with permuted geometry and the corresponding identifiers

Currently the input is restricted to QCSchema serialized molecules and isomeric, explicit H SMILES. Should we also allow input files as long as they provide 3D geometry and connectivity (mol2, pdb with connectivity, sdf)?

Given that coordinates files can include coordinates that are also 2D geometries, inputs are restricted to qc-schema serialized molecules and isomeric, explicit H SMILES. When strict=False is used, the input SMILES does not need to have explicit H SMILES.