Generating mol files from crystallographic output ("ready for MD")

phraenquex commented 1 year ago

@matteoferla knows something about this - link up with Tim and Jenke

matteoferla commented 1 year ago

Janke's username is @jenkescheen

@phraenquex, to clarify, is the issue "what and how is the best way to store bond orders in the structure file for follow analysis?"

To get bond order, I currently match by substructure and atom count the compound without bound order against a table of all the compounds in all the XChem libraries but modified to lack silly stuff like salts (e.g. Cl.CCO(=O)), which is not great. So what is needed is as a minimum would be, IMO

compound SMILES as appears with ideal protonation and regardless of missing atoms —so a covalent would be *CC... as opposed to [Cl]CC...
library origin
libary id
compound as listed in the library Na.CC...

I understand that chirality becomes an issue if one occupancy is on one enantiomer and the other is on an another and that's been addressed, but I don't know how. The library ID and library name aren't needed really, but are useful for queries of the sort "What other compounds like this were present but didnt bind.

However, mmCIF was mentioned so you aren't after some PDB hack like:

REMARK 400 COMPOUND
REMARK 400 SMILES CCC=O
REMARK 400 LIBRARY York3D

Background info dump

My understanding of my involvement in this is to give mmCIF dictionary support, right? So here are some links to relevant mmCIF documentation for discussion. Apologies if any of this is stating the obvious. The verbiage is needed for everyone to be on the same page. So buckle up, these controlled dictionaries make EU bureaucratese seem like a free-flowing bongo circle...

A residue in the CompBiochem sense is a chemical component. A ligand is a non-polymer and non-ion. HETATM means (nearly) nothing in mmCIF.

Protein structures have a mmCIF, but so do chemical component references.

The documentation of the dictionaries has the following structure:

Stuff within a loop_ directive table is a "table". The first _xxx is a "category", while the .xxx part is an "item". More definitions: link.
Categories are clustered in "groups": these do not appear in mmCIF files but super helpful for navigation. Two noteworthy groups are chem comp group and the "pdbx" exchange PDB (which is for metadata basically).
A lot of items xlink with others, see "Child Data Items" at the bottom of the page.
Frequency in the PDB is written at the top (pictured)
Each category/item in green is either optional or compulsory in PDB structures (a 100% or 0.001% frequency basically)
Each category/item in blue is present in the chem compo reference datasets, but no PDB structures
Each category/item in white are okay during deposition, but get stripped anyway? :shrug:
Each item in red are compusory in their category table (loop_), yellow is something.
The (i) on the top right of the card might give extra info, but it might not (i.e. never does). The tabs at the top go to the home page 😵‍💫

In a protein mmCIF there are several useful dictionary fields for ligands. The _chem_comp category has:

.id (3ish-letter code),
.type (controlled yet verbose vocab of stuff like isopeptide crosslinking residue),
.name (anything basically, not IUPAC)

and a few interesting optionals (absent in PDBs):

.pdbx_casnum —pdbx = xlink to other stuff,
.one_letter_code — I assume all tools just assume it is Z/X and ignore this
.pdbx_smiles...

The pdbx_👾👾 means the are xlinks but they might not have a child allocated and you have to guess. But there is also the category _pdbx_chem_comp_descriptor is where all the details are stored for the crosslink. This is the give example:

loop_
    _pdbx_chem_comp_descriptor.comp_id
    _pdbx_chem_comp_descriptor.descriptor
    _pdbx_chem_comp_descriptor.type
    _pdbx_chem_comp_descriptor.program
    _pdbx_chem_comp_descriptor.program_version
     ATP       c1nc(c2c(n1)n(cn2)C3C(C(C(O3)COP(=O)(O)OP(=O)(O)OP(=O)(O)O)O)O)N
           SMILES           OPENEYE  1.5.0

Oddly only Descriptor is compulsory and .type is not restricted. So one could have SMILES, canonical_SMILES, SMILES_wo_ions etc. Inchi etc. are also seen.

In addition to the _pdbx_chem_comp_descriptor category, there is the _pdbx_chem_comp_model_descriptor category: this not in the sense modelling, but in the PDB sense MODEL, i.e. a state.

The docs mention _chem_comp_atom definitions, which seems like a pound-shop version of the _atom_site definitions. But this is for the idealised version of the chemical component and present in reference chem compos files and not really protein ones.

To make everything more confusing the The _chem_mod_atom & _chem_mod_bond categories are utterly non-standard but used within a few CCP4 tools, such as Phoenix.

Lastly, the PDB-traditional, but officially frowned upon _database_PDB_remark.text, aka REMARK in PDBs, is where folk dump notes in a scattered manner. Which would be parsed by tools and not ignored as say # comments.

(Sorry that this turned out rather long!)

phraenquex commented 1 year ago

@matteoferla @tdudgeon some relevant background is here: 2017 Steiner and Tucker (Can't find the more recent reference now.)

@matteoferla I couldn't work out if you're just brain-dumping or were in fact providing guidance on how to attack the problem.

The ACEDRG documentation is maybe most useful? It does mol > CIF; I don't think it does CIF > mol.

I'm trying to find more useful info from the crystallographic community.

phraenquex commented 1 year ago

Conor suggested this solution - I think maybe the last point should have been "write that mol to mol".

Load the cif into python using gemmi.cif.read_file (https://gemmi.readthedocs.io/en/latest/cif.html#reading-a-file)
Use the loop _chem_comp_atom.type_symbol in block comp_LIG to add atoms to a mol (https://openbabel.org/docs/dev/UseTheLibrary/Python_Pybel.html#combining-pybel-with-openbabel-py)
Use the loops _chem_comp_bond.atom_id_1, _chem_comp_bond.atom_id_2 and _chem_comp_bond.type in the comp_LIG to add bonds to the mol
Write that mol to smiles (https://openbabel.org/docs/dev/UseTheLibrary/Python_Pybel.html#id2)

phraenquex commented 1 year ago

Here's an example of a CIF restraint file. ASAP-0000001-001.cif.txt

phraenquex commented 1 year ago

After discussing with @jenkescheen, he points out that the SMILES is typically enough for software packages to make sense of what's in the PDB file. (Though having bond orders explictly defined may be useful anyway... I think?)

But I'm suspecting the problem is not the MOL file itself, but instead the format of the downloaded data overall: what files show up where, the schema of the metadata file, what files/info is missing, etc.

@jenkecheen and @apayne97 are going to discuss this on Thursday, at the ASAP CompChem Core meeting, and update this ticket.

And then next week, we'll probably end up changing the title and scope of this ticket.

@jenkescheen, a VERY VERY important question: WHICH software packages do you use that don't need the MOL file to be complete? If they're under license (e.g. OpenEye), then that's not enough, then we still need to fix this problem.

JenkeScheen commented 1 year ago

The vast majority of our pipeline uses the OpenEye API (see https://github.com/choderalab/asapdiscovery/blob/main/asapdiscovery-data/asapdiscovery/data/openeye.py). If there are any major roadblocks on your side we can make things easier by just deferring to RDKit (or other more open packages) where needed.

tdudgeon commented 1 year ago

There are 2 real problems here.

how to generate SMILES or Molfile. It doesn't really matter which as once you generate a Molfile then you can generate the SMILES with either RDKit or OBabel (and probably lots of other tools too), and we should aim to generate both. The key here is to be able to generate the molecular graph (with bond orders). The CIF referenced in soakdb does contain the graph and bond order info. But neither OBabel (fails to read this type of CIF) nor RDKit (doesn't handle CIF at all) can read this file. It seems that only gemmi can read it. But gemmi cannot write to Molfile or SMILES. It can write to PDB, and, if so, I think (hope) it will write out CONECT records (the ligand PDB files where they exist in the XChem data do contain CONECT records, unlike the ligands present in the protein PDB files). Without CONECT records the PDB of the ligand is pretty useless. This means we should be able to use gemmi to do CIF -> PDB and then RDKit to do PDB -> Mol -> SMILES. A bit hacky, but it should work. Alternative is to use low level parsing of the CIF do build up the molecular graph atom by atom and bond by bond (e.g. Conor's suggestion) using RDKit or Obabel. I typically try to avoid these low level approaches as they can turn out to be buggy and need maintenance, but, in principle it should work.
It's far from clear whether there is always a valid CIF file to use. SoakDB has a property (the RefinementCIF column) that refers to this file, and it seems to exist in all cases where it would be expected to exist. But what happens in the case of there being multiple ligands (e.g. the case of two different ligand molecules, or the same ligand being bound in two places)? Is the SoakDB reference to only one of the files? Is the CIF a multi-molecule CIF (if so how to deconvolute it). According to Conor the RefinementCIF column as the definitive reference to THE CIF file(S) is probably non-sensical.

phraenquex commented 1 year ago

"a bit hacky" - agree, but only in the sense of "inelegant", and not in the sense of "risky". So go ahead with that. (We might be able to convince Gemmi author to immplement cif>mol, but we don't want to have to wait on that.
CIF is definitely a multi-molecule format, and each block in CIF declares which ligand (3-letter code) in the PDB it refers to.
Action: test whether gemmi handles this case correctly. Get files from Daren or Conor.

If the RefinementCIF column is indeed unreliable, that would be for buggy-XCE reasons. But there will be a bullet-proof heuristic of how to find the CIF file using other SoakDB columns (e.g. refinementDir or similar) - it always accompanies the output of refinement. (Unless somebody really implemented some bad shit, which I doubt.)

Get a multi-ligand example from Daren (there are some in the Mac1 project somewhere)

phraenquex commented 1 year ago

@tdudgeon and Conor believe there are no open source tools to convert CIF restraints to PDB with CONECT records.

@tdudgeon says it's "a few hours, max days", to parse the CIF, convert to MOL and SMILES etc.

Root of problem:

the ligand actually modelled in PDB (written out by refinement) does not necessarily have a SMILES string available
the PDB file written by refinement does not necessarily have CONECT records written in for the ligand.
the only place the bonds exist reliably is in the restraints CIF file.

apayne97 commented 1 year ago

So just a few thoughts from the MD / cheminformatics side, honestly a lot of this thread has gone a bit past me and I'm curious to learn more.

1) Somehow in the past the PDB files we've gotten from Fragalysis have correct bond orders specified through duplicated CONECT records. Not ideal but if it worked in the past it can work again?

2) Can someone explain how this is normally done? I can't fathom how you would normally "parse the CIF" if not with a tool that can take in whatever you consider to be the ground truth format and convert it to one of the standard formats?

the ligand actually modelled in PDB (written out by refinement) does not necessarily have a SMILES string available

SMILES can be computed from a sufficiently accurate model using RDKit, etc. What do you mean there isn't one "available"? Also shouldn't whatever source SDF / SMILES that was originally used to in the process of preparing the crystals (i.e. ordering the molecules) be sufficiently accurate? it might have wrong chirality but it would at least have correct bond orders?

phraenquex commented 1 year ago

@apayne97 you ask good questions, but the bit you're not aware of is: I'm anxious that this upload process does NOT break if upstream data or files are missing or incomplete (e.g. no original SMILES, original SMILES not what was modelled, PDB CONECT records missing...). Because that's really super hard to troubleshoot, and anyway irrelevant for the requirement, which is: just produce an accurate smiles and mol.

But it looks like Tim has this under control now.

phraenquex commented 1 year ago

@tdudgeon I imagine this ticket is more relevant for #1076 than the new data model?

In which case, it would be one of the few green-release tickets that should go to the old stack.

m2ms / fragalysis-frontend

Generating mol files from crystallographic output ("ready for MD") #1064

Background info dump