Open mwinokan opened 7 months ago
Not sure if this is something that frontend can solve. We are just telling NGL View that here is the URL (which we have from backend) to a file we want to display. In case of ligands we have contents of a .sdf
file which we also have from backend.
@phraenquex says the solution is to copy the protein-covalent-linking atom into the ligand .sdf file and display. @tdudgeon says this is an XCA issue. How does XCA identify covalent ligands - Daren can possibly help with what can be extracted from SoakDB/PDB (is there a quick win?).
@Waztom to follow up re adding columns to SoakDB => covalents/multiple ligands/what you was soaked and what came out. Check re connect records on PDB - can confirm lingage in .cif file.
Definitley an XChemAlign ticket; may have ramifications in loader, b/e, API and f/e.
Two aspects:
Update: NOT @Waztom's action - soakDB columns cannot be touched (for all practical purposes).
Solution: define a convention for how crystallographers must record in the soakDB compoundSMILES
column, that the modelled SMILES is different from what was originally soaked.
Let's do: <modelledSMILES> original:<originalSMILES>
e.g CCCNNCCC=O original:CCCNNCCC
(Note spaces.)
@mwinokan check with Daren whether this will have any impact on XCE / soakDB end. I predict there won't be.
Metadata does need to have two columns, and what needs to be served to the LHS is the original smiles. The Loader will need to handle the parsing, and the API will need to serve the original (if it exists).
For covalents:
original
empty.@mwinokan ask Daren if higher or lower priority than #1432
I looked at the x1775 data.
The ligand CIF file contains no S atom.
The PDB file contains no CONECT
records, but does contain this:
LINK SG CYS A 110 C7 LIG A 201 1555 1555 2.22
The SMILES that XCA currently generates for this structure is CCOc1ccc(C(=O)CC)cc1
If this is typical of the data that is generated then it seems that we can look for the LINK
record, and assume if corresponds to a covalent bond. And it probably means we don't need to handle the modeled vs original SMILES to be able to graft on the covalent bond to the extracted ligand.
Briefly caught up with Daren and he is concerned that putting multiple SMILES strings in the CompoundSmiles column is likely to cause problems with XCE which does already parse it for delimiters. We did discover an unused column compoundSMILESproduct
which we could potentially use but that could cause issues.
@tdudgeon looking at your above comment, does this mean that the soaked smiles is enough?
Having the original soaked SMILES might be good for other purposes, but I don't think it's needed to address the covalent ligand issue as the LINK
record gives us all we need.
@tdudgeon Spoke to Ryan and the LINK record can be assumed to be there
A complication will be when there are multiple LINK
records e.g. when the structure is not a monomer, or when there are multiple ligands. Could we try to dig out examples of these?
Just had a lengthy brainstorm with Daren to try and iron out the soakDB spec for both combi-soaks and covalent compounds.
Some small tinkering with XCE will be needed, but Daren and I are fairly confident that the following specification will be feasible:
CompoundCode
column will be a semi-colon delimited list of compound code strings e.g. PDB0201;PDB0202
CompoundSMILES
column will be a semi-colon delimited list of SMILES strings of the modelled ligands e.g. CN1C=NC2=C1C(=O)N(C(=O)N2C)C;c1ccccc1-c1ccccc1
<modelled SMILES> <soaked SMILES>
(note space)For covalent combi soaks they would be combined as follows:
<modelled SMILES 1>;<modelled SMILES 2> <soaked SMILES 2>
Be aware that using semi-colon as separator will preclude the use of ChemAxon's cxsmiles extensions, which can include all sorts of characters, including semi-colon. Tab is probably the safest separator to use if that is a concern. See https://docs.chemaxon.com/display/docs/formats_chemaxon-extended-smiles-and-smarts-cxsmiles-and-cxsmarts.md
Thanks for the heads up @tdudgeon. cxsmiles are already known to break the XCE pipeline, and several parts of the codebase has already been written to support the semi-colons so that's the easiest route for now
@mwinokan I've mostly got the changes working to add the covalent pseudo-bond, but hit an annoying inconsistency. In entry x1775 the link record looks like this:
LINK SG CYS A 110 C7 LIG A 201 1555 1555 2.21
But in x1776 it looks like this:
LINK C1 LIG A 201 SG CYS A 110 1555 1555 1.87
Yes, the atoms are listed in the other order! There's nothing in the PDB file spec that says anything about the order (nor could there be really). But before I change things to handle this (which will be more complex than it might seem as we cannot assume anything about the residue names) I just wanted to check that we can't enforce some consistency here in XCE to avoid the inconsistency in the first place.
@tdudgeon, just spoke to @phraenquex and he says there is no way to ensure the consistency so we'll have to support both
@tdudgeon has implemented this but there are edge cases to discuss, e.g. link records between chains and multiple links
@phraenquex says to add all linked protein atoms to the ligand. Links between chains are not a small-molecule problem and we don't need to support them
@tdudgeon please parse all valid link records in the PDB
Currently @tdudgeon is generating separate mol files for the original and modified ligand, @phraenquex says we should not serve the original/unmodified ligand (because it does not correspond to the experiment)
@mwinokan is working with Daren to test having both the modelled and soaked smiles in SoakDB so there is no need to generate it at this point
I'm almost ready to go on the covalent ligands.
Just one question - do we want the PDBs to be always inspected for LINK
records, or do we want this to only happen when turned on in the config.yaml
?
@tdudgeon I'd have said "always".
If you know of a strong use-case for turning it off, then sure, include an option in config.yaml
for doing so, but by default it should be done.
I've left is so that XCA always looks for LINK
records. This means the covalent ligand issue is resolved, and it did not need the soaked v.s observed ligand SMILES issue to be addressed. If we do need to address this then I would suggest we open a new ticket. It would need to address the following:
@phraenquex says:
I'm implementing the logic for this and have come across an issue that needs a policy decision.
What I'm trying to do is to write out modeled_smiles
and/or soaked_smiles
properties but only if these are different to the SMILES that is generated from the CIF file. e.g. like this:
ligands:
LIG:
smiles: Nc1ccc(-c2ncno2)cc1
compound_code: Z760031264
modeled_smiles: Nc1ccc(cc1)c2ncno2
However the SMILES defined in soakDB are not canonical SMILES as generated by RDKit. For instance, in the above case the SMILES from the CIF is Nc1ccc(-c2ncno2)cc1
whilst that from soakDB is Nc1ccc(cc1)c2ncno2
. These are the same molecule. It is simple to generate the canonical RDKit form of the soakDB molecule and compare that, but is that what is required or is there a reason to have what is defined in soakDB present in the data. And if they are different do we write out the SMILES from soakDB or the RDKit canonical version of that molecule?
In addition to this I find a case where the SMILES in soakDB (CC(CS(=O)(=O)N)c1ccccc1
) is not chiral but the one in the CIF (C[C@H](CS(N)(=O)=O)c1ccccc1
) is, though other than chirality they are the same molecule. I think it's clear that these should be considered different (subject to the question above) and the soaked/modeled smiles should be included in the output.
@tdudgeon says we do need to use the rdkit canonical smiles, but asks which version needs to be stored in the b/e and surfaced in the f/e?
@phraenquex says that by default the canonical one needs to be served, but if needed we should keep the original string around (b/e and metadata download only for now, and eventually f/e). Call the original SMILES string something like SoakDB_SMILES_modelled
and SoakDB_SMILES_soaked
@tdudgeon please keep all these versions around in the output, even if they are identical on string comparison
Modeled and soaked SMILES from the CompoundSMILES
column in soakDB are now included in the output in both original and RDKit canonical forms.
The number of semi-colon separated values must be the same as the number of molecules in the CIF file (and the order is assumed to be same), otherwise an error is thrown.
The same is the case for the compound codes read from the CompoundCode
column.
@tdudgeon has given @kaliif some test data and will require a migration of the database
Changes ready to be merged to staging
@kaliif is still busy with resolving the migration conflicts. This needs to be complete before further testing.
In the meantime, can prepare the test data (Jasmin has test data that needs semicolon separated SMILES and compound codes added to SoakDB).
The loader changes are in staging and need testing
Ryan's latest A71EV2A upload (July 10th 2024 on staging) includes four covalent fragments: x1775, x1776, x1778, x1779 that are not showing as covalently bound to the Cys110 protein residue.
@boriskovar-m2ms do you have what you need to make these changes as part of mint?