openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
319 stars 93 forks source link

Zero-width spaces in SMILES cause unexpected behaviour. #1807

Open Yoshanuikabundi opened 10 months ago

Yoshanuikabundi commented 10 months ago

Is your feature request related to a problem? Please describe. A zero-width space (\u200b) in a SMILES string causes RDKit to truncate the molecule without an error:

>>> from openff.toolkit import Molecule, RDKitToolkitWrapper
>>> 
>>> #      ZWSs here ‾‾‾‾\/‾‾‾‾‾‾‾‾‾‾\/‾‾‾‾‾‾‾‾‾‾‾‾\/‾‾\/‾‾\/‾‾\/
>>> smiles = "COc1ccc(cc1)​C(=O)/C(C#N)​=N/Nc2ccc(cc2)​[N+]​(=O)​[O-]​"
>>> 
>>> Molecule.from_smiles(
...     smiles, 
...     toolkit_registry=RDKitToolkitWrapper()
... ).n_atoms
16
>>> Molecule.from_smiles(
...     smiles.replace('\u200b', ''), 
...     toolkit_registry=RDKitToolkitWrapper()
... ).n_atoms
36

I ran into this very confusing behavior while copying SMILES strings from specs.net.

Describe the solution you'd like

Ideally, RDKit would raise a parse error rather than just truncate, but we could solve this issue on our end instead.

  1. We could strip zero-width spaces (and other invisible, SMILES-irrelevant unicode symbols) from SMILES strings before passing them on to RDKit, but this may not solve similar problems with codepoints that look like ASCII symbols but aren't.
  2. We could normalize all unicode to the closest ascii-compatible character, dropping anything that is not ascii:
    >>> import unicodedata
    >>>
    >>> normalized = unicodedata.normalize('NFKD', smiles).encode('ascii','ignore').decode()
    >>> normalized
    'COc1ccc(cc1)C(=O)/C(C#N)=N/Nc2ccc(cc2)[N+](=O)[O-]'
    >>> Molecule.from_smiles(
    ...     normalized, 
    ...     toolkit_registry=RDKitToolkitWrapper()
    ... ).n_atoms
    36
  3. We could also just raise an error if any non-ASCII characters are detected. Are non-ASCII characters ever used in SMILES?
j-wags commented 10 months ago

If we choose to do anything on our end, I'm kinda in favor of 3. But since this has taken 5 years to emerge I think the best course would be to raise this on the RDKit issue tracker and see if @greglandrum invites us to submit a PR to fix upstream.

greglandrum commented 10 months ago

I'd be happy to see a PR or bug report with examples to raise an error on stuff like this on the RDKit side.

Given that the parsing code is working purely with 8 bit chars and ASCII, there's no other sensible alternative