Potential issue with representation of aromatic Ns with Hs

baoilleach commented 1 year ago

CHEMBL6509.from_cd.cdxml.txt chembl6509.from_pycdxml.cdxml.txt CHEMBL6509.mol.txt pycdxml version: e2fefc82a44bda17b5c91de208947185f36ecaad

From roundtrip testing versus the RDKit reader (mol->cdxml->cansmi) I found that mismatches in cansmi occured when dealing with aromatic Ns with hydrogens.

Here's a specific example: CHEMBL6509, a MOL file provided by ChEMBL. I've attached the original MOL file, along with the PyCDXML CDXML file. If opened in ChemDraw it all looks fine. In contrast, when read by RDKit it's missing the H on the nitrogen:

NCC1=C[N]c2ccccc21 # RDKit reading PyCDXML CDXML

At this point, I thought it could be an error in the RDKit reader. However, if I read the original MOL file in ChemDraw and save as CDXML (attached), then RDKit converts it as expected:

NCc1c[nH]c2ccccc12 # RDKit reading ChemDraw CDXML

baoilleach commented 1 year ago

Presumably <n id="4" p="67.57 66.77" Z="24" Element="7" NumHydrogens="0" AS="N"> should have NumHydrogens="1" instead?

baoilleach commented 1 year ago

Here's the fix - I'll submit a PR:

diff --git a/pycdxml/cdxml_converter/rdkit_chemdraw.py b/pycdxml/cdxml_converter/rdkit_chemdraw.py
index d38e99c..5e48a1b 100644
--- a/pycdxml/cdxml_converter/rdkit_chemdraw.py
+++ b/pycdxml/cdxml_converter/rdkit_chemdraw.py
@@ -98,7 +98,7 @@ def mol_to_document(mol: Chem.Mol, chemdraw_style: dict = None, conformer_id: in
         props = {"p": p, "Z": str(20 + object_id), "Element": str(atom.GetAtomicNum())}

         if atom.GetAtomicNum() != 6:
-            props["NumHydrogens"] = str(atom.GetNumImplicitHs())
+            props["NumHydrogens"] = str(atom.GetTotalNumHs())

         if atom.HasProp('_CIPCode'):
             props["AS"] = atom.GetProp('_CIPCode')

kienerj / pycdxml

Potential issue with representation of aromatic Ns with Hs #20