plinder-org / plinder

Protein Ligand INteraction Dataset and Evaluation Resource
https://plinder.sh
Apache License 2.0
166 stars 9 forks source link

Fix eval v3000 dative bond issue #69

Open yusuf1759 opened 1 month ago

yusuf1759 commented 1 month ago

Context: rdkit automatically saves any molecule with dative bond (e.g HEM) automatically as v3000 sdf. However, ost can't load v3000 files with DATIVE bond. Throws Exception: Bad bond line 100: Bond type number '9' not within accepted range (1-8).

Fix: Change DATIVE bond to UNSPECIFIED on the fly

github-actions[bot] commented 1 month ago

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  src/plinder/core/index
  system.py 362-365
  src/plinder/core/loader
  __init__.py
  dataset.py
  featurizer.py
  transforms.py
  utils.py
  src/plinder/core/scores
  index.py
  src/plinder/core/split
  plot.py
  utils.py
  src/plinder/core/structure
  atoms.py 397-401
  diffdock_utils.py
  structure.py
  vendored.py
  src/plinder/core/utils
  dataclass.py
  src/plinder/eval/docking
  utils.py 97-100
  write_scores.py
Project Total  

This report was generated by python-coverage-comment-action

xrobin commented 1 month ago

Do you have the SDF file so that we could potentially fix this directly in OpenStructure?

xrobin commented 1 month ago

I was able to reproduce the behavior by changing a bond type to 9 in an arbitrary SDF file manually.

There's a fix in OST now (upcoming 2.9.0 release branch) where you can set fault_tolerant=True (on the call to LoadSDF directly, or on the IO profile for LoadEntity) to force OST to read the file with the invalid bond type.

However I'm not sure exactly in which context this came up. RDKit itself doesn't like SDF files with a bond type 9, and if I read it with Chem.SDMolSupplier I get a very similar warning:

[13:20:16] unrecognized query bond type, 9, found on line 16. Using an "any" query.

The bond type is then marked as unspecified in the resulting mol, not as dative:

>>> mol.GetBonds()[5].GetBondType()
rdkit.Chem.rdchem.BondType.UNSPECIFIED

I was also not able to trigger RDKit to save a V3000 file with a dative bond:

>>> mol.GetBonds()[5].SetBondType(Chem.rdchem.BondType.DATIVE)
>>> mol.GetBonds()[5].GetBondType()
rdkit.Chem.rdchem.BondType.DATIVE
>>> print(Chem.MolToMolBlock(mol))
Simple Ligand
     RDKit          2D

  6  6  0  0  1  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    1.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    1.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.0000    1.0000    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    2.0000    2.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.0000   -1.0000    0.0000 Cl  0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0
  1  3  1  0
  1  6  1  0
  2  4  1  0
  3  4  1  0
  4  5  8  0
M  CHG  1   1   1
M  END

Clearly it looks like dative bonds (whatever they are) should not end up in SDF files to start with. Bond type 9 is not part of the SDF standard. I don't know how the invalid file was created.

Regarding the fix: I'm not sure what bond type number results from setting it to unspecified in RDKit. Bond types 4-8 should also not be in SDF files (they are reserved for queries). OpenStructure 2.9.0 will complain about it but read it anyways. It is unlikely to have any effect in any algorithm in OpenStructure as we ignore bond order throughout. It might affect other external tools you are using in Plinder, though.