Handle soaked vs modelled SMILES from soakdb (e.g. Covalents).

mwinokan commented 7 months ago

Ryan's latest A71EV2A upload (July 10th 2024 on staging) includes four covalent fragments: x1775, x1776, x1778, x1779 that are not showing as covalently bound to the Cys110 protein residue.

@boriskovar-m2ms do you have what you need to make these changes as part of mint?

boriskovar-m2ms commented 4 months ago

Not sure if this is something that frontend can solve. We are just telling NGL View that here is the URL (which we have from backend) to a file we want to display. In case of ligands we have contents of a .sdf file which we also have from backend.

Waztom commented 4 months ago

@phraenquex says the solution is to copy the protein-covalent-linking atom into the ligand .sdf file and display. @tdudgeon says this is an XCA issue. How does XCA identify covalent ligands - Daren can possibly help with what can be extracted from SoakDB/PDB (is there a quick win?).

@Waztom to follow up re adding columns to SoakDB => covalents/multiple ligands/what you was soaked and what came out. Check re connect records on PDB - can confirm lingage in .cif file.

phraenquex commented 2 months ago

Definitley an XChemAlign ticket; may have ramifications in loader, b/e, API and f/e.

phraenquex commented 2 months ago

Two aspects:

Handle SMILES for ligand that changed between soaking and modelling.

Update: NOT @Waztom's action - soakDB columns cannot be touched (for all practical purposes).

Solution: define a convention for how crystallographers must record in the soakDB compoundSMILES column, that the modelled SMILES is different from what was originally soaked.

Let's do: <modelledSMILES> original:<originalSMILES> e.g CCCNNCCC=O original:CCCNNCCC (Note spaces.)

@mwinokan check with Daren whether this will have any impact on XCE / soakDB end. I predict there won't be.

Metadata does need to have two columns, and what needs to be served to the LHS is the original smiles. The Loader will need to handle the parsing, and the API will need to serve the original (if it exists).

Handle covalents.

For covalents:

[ ] To see how covalent linkage is encoded, look at files at top of ticket. (Also documented somewhere in CCP4 world.
[ ] For the SDF file, copy the covalently linked atom from the protein PDB into the SDF - that should be all that's needed for NGL to make it look like they're covalently linked.
[ ] If the original is not available (e.g. for PDB downloads), then just leave original empty.

phraenquex commented 2 months ago

@mwinokan ask Daren if higher or lower priority than #1432

tdudgeon commented 2 months ago

I looked at the x1775 data.

The ligand CIF file contains no S atom. The PDB file contains no CONECT records, but does contain this:

LINK         SG  CYS A 110                 C7  LIG A 201     1555   1555  2.22

The SMILES that XCA currently generates for this structure is CCOc1ccc(C(=O)CC)cc1

If this is typical of the data that is generated then it seems that we can look for the LINK record, and assume if corresponds to a covalent bond. And it probably means we don't need to handle the modeled vs original SMILES to be able to graft on the covalent bond to the extracted ligand.

mwinokan commented 2 months ago

Briefly caught up with Daren and he is concerned that putting multiple SMILES strings in the CompoundSmiles column is likely to cause problems with XCE which does already parse it for delimiters. We did discover an unused column compoundSMILESproduct which we could potentially use but that could cause issues.

@tdudgeon looking at your above comment, does this mean that the soaked smiles is enough?

tdudgeon commented 2 months ago

Having the original soaked SMILES might be good for other purposes, but I don't think it's needed to address the covalent ligand issue as the LINK record gives us all we need.

mwinokan commented 2 months ago

@tdudgeon Spoke to Ryan and the LINK record can be assumed to be there

tdudgeon commented 2 months ago

A complication will be when there are multiple LINK records e.g. when the structure is not a monomer, or when there are multiple ligands. Could we try to dig out examples of these?

mwinokan commented 2 months ago

Just had a lengthy brainstorm with Daren to try and iron out the soakDB spec for both combi-soaks and covalent compounds.

Some small tinkering with XCE will be needed, but Daren and I are fairly confident that the following specification will be feasible:

Combi soaks

The CompoundCode column will be a semi-colon delimited list of compound code strings e.g. PDB0201;PDB0202
The CompoundSMILES column will be a semi-colon delimited list of SMILES strings of the modelled ligands e.g. CN1C=NC2=C1C(=O)N(C(=O)N2C)C;c1ccccc1-c1ccccc1

Covalent (or other example of soaked SMILES not being the same as modelled SMILES

Each of the semi-colon delimited SMILES sub-strings above can also be space-delimited to communicate a difference between the soaked and modelled smiles i.e. <modelled SMILES> <soaked SMILES> (note space)
If the number of space-delimited elements is just one then the soaked and modelled SMILES are the same

For covalent combi soaks they would be combined as follows:

<modelled SMILES 1>;<modelled SMILES 2> <soaked SMILES 2>

tdudgeon commented 2 months ago

Be aware that using semi-colon as separator will preclude the use of ChemAxon's cxsmiles extensions, which can include all sorts of characters, including semi-colon. Tab is probably the safest separator to use if that is a concern. See https://docs.chemaxon.com/display/docs/formats_chemaxon-extended-smiles-and-smarts-cxsmiles-and-cxsmarts.md

mwinokan commented 2 months ago

Thanks for the heads up @tdudgeon. cxsmiles are already known to break the XCE pipeline, and several parts of the codebase has already been written to support the semi-colons so that's the easiest route for now

tdudgeon commented 2 months ago

@mwinokan I've mostly got the changes working to add the covalent pseudo-bond, but hit an annoying inconsistency. In entry x1775 the link record looks like this:

LINK         SG  CYS A 110                 C7  LIG A 201     1555   1555  2.21

But in x1776 it looks like this:

LINK         C1  LIG A 201                 SG  CYS A 110     1555   1555  1.87

Yes, the atoms are listed in the other order! There's nothing in the PDB file spec that says anything about the order (nor could there be really). But before I change things to handle this (which will be more complex than it might seem as we cannot assume anything about the residue names) I just wanted to check that we can't enforce some consistency here in XCE to avoid the inconsistency in the first place.

mwinokan commented 2 months ago

@tdudgeon, just spoke to @phraenquex and he says there is no way to ensure the consistency so we'll have to support both

mwinokan commented 2 months ago

@tdudgeon has implemented this but there are edge cases to discuss, e.g. link records between chains and multiple links

@phraenquex says to add all linked protein atoms to the ligand. Links between chains are not a small-molecule problem and we don't need to support them

@tdudgeon please parse all valid link records in the PDB

Currently @tdudgeon is generating separate mol files for the original and modified ligand, @phraenquex says we should not serve the original/unmodified ligand (because it does not correspond to the experiment)

@mwinokan is working with Daren to test having both the modelled and soaked smiles in SoakDB so there is no need to generate it at this point

tdudgeon commented 2 months ago

I'm almost ready to go on the covalent ligands. Just one question - do we want the PDBs to be always inspected for LINK records, or do we want this to only happen when turned on in the config.yaml?

phraenquex commented 2 months ago

@tdudgeon I'd have said "always".

If you know of a strong use-case for turning it off, then sure, include an option in config.yaml for doing so, but by default it should be done.

tdudgeon commented 1 month ago

I've left is so that XCA always looks for LINK records. This means the covalent ligand issue is resolved, and it did not need the soaked v.s observed ligand SMILES issue to be addressed. If we do need to address this then I would suggest we open a new ticket. It would need to address the following:

How to specify this in soakDB (we already have a proposed spec for this)
In what way this soaked v.s observed ligand SMILES issue needs to be handled in the front end
How XCA, target loader and API need to change to allow for this

mwinokan commented 1 month ago

@phraenquex says:

use the proposed spec
The API needs to serve both smiles, but the soaked smiles is the one that is shown in the hit navigator (2d drawing). The NGL view will show the modelled molecule. Further spec will be needed to include both smiles in the UI

tdudgeon commented 1 month ago

I'm implementing the logic for this and have come across an issue that needs a policy decision. What I'm trying to do is to write out modeled_smiles and/or soaked_smiles properties but only if these are different to the SMILES that is generated from the CIF file. e.g. like this:

        ligands:
          LIG:
            smiles: Nc1ccc(-c2ncno2)cc1
            compound_code: Z760031264
            modeled_smiles: Nc1ccc(cc1)c2ncno2

However the SMILES defined in soakDB are not canonical SMILES as generated by RDKit. For instance, in the above case the SMILES from the CIF is Nc1ccc(-c2ncno2)cc1 whilst that from soakDB is Nc1ccc(cc1)c2ncno2. These are the same molecule. It is simple to generate the canonical RDKit form of the soakDB molecule and compare that, but is that what is required or is there a reason to have what is defined in soakDB present in the data. And if they are different do we write out the SMILES from soakDB or the RDKit canonical version of that molecule?

In addition to this I find a case where the SMILES in soakDB (CC(CS(=O)(=O)N)c1ccccc1) is not chiral but the one in the CIF (C[C@H](CS(N)(=O)=O)c1ccccc1) is, though other than chirality they are the same molecule. I think it's clear that these should be considered different (subject to the question above) and the soaked/modeled smiles should be included in the output.

mwinokan commented 1 month ago

@tdudgeon says we do need to use the rdkit canonical smiles, but asks which version needs to be stored in the b/e and surfaced in the f/e?

@phraenquex says that by default the canonical one needs to be served, but if needed we should keep the original string around (b/e and metadata download only for now, and eventually f/e). Call the original SMILES string something like SoakDB_SMILES_modelled and SoakDB_SMILES_soaked

modelled SMILES (soakDB)
modelled SMILES (canonical) equivalent to SMILES in CIF
soaked SMILES (soakDB)
soaked SMILES (canonical) might not be necessary

@tdudgeon please keep all these versions around in the output, even if they are identical on string comparison

tdudgeon commented 1 month ago

Modeled and soaked SMILES from the CompoundSMILES column in soakDB are now included in the output in both original and RDKit canonical forms. The number of semi-colon separated values must be the same as the number of molecules in the CIF file (and the order is assumed to be same), otherwise an error is thrown. The same is the case for the compound codes read from the CompoundCode column.

mwinokan commented 1 month ago

@tdudgeon has given @kaliif some test data and will require a migration of the database

kaliif commented 1 month ago

Changes ready to be merged to staging

Waztom commented 4 weeks ago

@kaliif is still busy with resolving the migration conflicts. This needs to be complete before further testing.

In the meantime, can prepare the test data (Jasmin has test data that needs semicolon separated SMILES and compound codes added to SoakDB).

mwinokan commented 3 weeks ago

The loader changes are in staging and need testing

m2ms / fragalysis-frontend