Graceful handling of files from RCSB PDB

phraenquex commented 9 months ago

XCA should be doing the downloading of files from the PDB. Including extracting event maps, if they were uploaded.

This won't be hard, just needs working out a few arcane details of where to find the relevant information.

And needs YAML keywords.

mwinokan commented 8 months ago

The loader B/E will also need to support these non-XChem PDB's (called manual in the XCA config.yaml).

We need a spec for the observation short-codes, maybe the following:

code_prefix: A71 and XChem experiment: x0310 will result in observations with shortcode such as A71x0310a
code_prefix: A71 and protein data bank (RCSB PDB) code: 8R4I will result in observations with shortcode such as A71_8R4Ia. The _ is a placeholder delimiter.

@phraenquex what do you think?

phraenquex commented 8 months ago

It will almost always be sufficient to use the PDBID directly, e.g. 8r4i or 3de7 or whatnot.

But it makes me realise that what's needed is a general mechanism to map experiment names to short codes ==> new ticket, #1383.

mwinokan commented 7 months ago

The observation shortcodes for data from the RCSB PDB should be formatted as follows:

No prefix needed
Include the PDB shortcode
Dash delimiter between PDB code and observation letter
same observation lowercase letter

E.g. 8R4I-a

@mwinokan to check with Daren

mwinokan commented 7 months ago

Confirmed with Daren: it may be desirable to have a prefix defined, in which case I would suggest using whatever is in the config.yaml.

So the spec for observation shortcodes for structures from the RCSB PDB is:

{prefix}-{pdb_code}-{observation_letter} e.g. : A71-3PJR-a

mwinokan commented 7 months ago

@phraenquex and @mwinokan to resolve this and #1383

mwinokan commented 7 months ago

@phraenquex and I spoke about this yesterday, until we resolve #1383 we believe the prefix is not needed, the user can just change the directory name to include the prefix for now.

@tdudgeon does this mean that an alignment of files solely from RCSB PDBs is easier to achieve?

tdudgeon commented 7 months ago

I think this means that there is no real impact on XCA. It's a matter for how the target loader would handle these. @kaliif can you assess this?

Where I say _no real impact on XCA, that's not quite correct because the code_prefix and code_prefix_tooltip are mandatory values for all inputs (including manual) and would be written into the outputs, but the target loader would need to ignore these for manual inputs. We could change this in XCA so that code_prefix and code_prefix_tooltip are not handled for manual inputs, but still there would likely need to be changes in target loader.

mwinokan commented 7 months ago

The user will need to specify a code_prefix but this can be empty. Continue to ignore code_prefix's for manual inputs

The name is taken from the name of the subdirectory for each manual input. All manual inputs should be in one directory and the filename specifies the name as uploaded (not subdirectories).

@tdudgeon please implement the above two in

Target loader will need to accept files without the xchem syntax (e.g. x????) for manual inputs (not model_building). @kaliif check the type flag and generate a shortcode using: {pdb_code}-{letter} where letter is used to specify the observation as with XChem data.

How do we expose metadata about the origin of the PDB (i.e. "this is from the RCSB PDB" or "someone modified this file and sent it to me"?) See #1383

phraenquex commented 7 months ago

Adding to this: needs changes in XCA (@tdudgeon) and loader (@kaliif).

@kaliif: we also need support for descriptions.csv, which should be:

fileroot, free text that describes file If it's present in the subdirectory, then ingest it to B/E.
If not obvious where to store: presumably (for now) as extra column in the observations table.

mwinokan commented 7 months ago

@tdudgeon says XCA work is done:

the PDB files now need to be in a single directory
the user environment has been updated

@kaliif has added changes into staging that will load all 'manual' files.

@mwinokan to work with Jasmin to check the new documentation and try uploading some files to staging.

In theory it is also possible to align a bunch of files from the protein data bank and upload them as a new target. @mwinokan to get Andre to test with Zika helicase.

tdudgeon commented 7 months ago

User docs for this: https://github.com/xchem/xchem-align/blob/master/USER-GUIDE.md#non-diamond-datasets

mwinokan commented 7 months ago

Tooltip to be handled in #1402

mwinokan commented 7 months ago

@tdudgeon says he needs to talk to @ConorFWild because there are a few problems

Alignments are not being made unless the PDB's are labelled as reference structures
Alignments that are generated are not correct w.r.t. what the backend is expecting
Extracting the ligands relies on a CIF for the ligand which is not present in data from the PDB (needs a new discussion). We must demand that the users provide a CIF/.mol (@ConorFWild & @tdudgeon agree). Multiple free tools exist to convert smiles to CIF. To clarify, the CIF is used for bond orders and to generate the molecular graph, the coordinates are taken from the PDB and subsequent alignment.
The ligand residue may not be named LIG in files from the PDB

@phraenquex says that #1332 may help cover 3. and 4.

mwinokan commented 7 months ago

@tdudgeon says this is closer to working than anticipated. Tim has it working for basic RCSB PDB's but the requisite is that there is a ligand .CIF file.

Can the CIF files be reliable downloaded from the PDB? @mwinokan to find out.

@tdudgeon ask Jasmin to share her PDB's and CIF's for RCSB data in the #xchem-align channel

@ConorFWild says that substructure matching between PDB's and CIF's can be complex (e.g. symmetry edge cases and inconsistency in CIF formatting)

mwinokan commented 6 months ago

Re: obtaining .cif files for ligands in the PDB

@tdudgeon

For a given PDB entry: e.g. 6VUQ for CHIKV_Mac (pdb link) under 'small molecules' there is the button to "download ideal coordinates CDD file" which requests the .cif of the ligand. By using the ligand name "APR" the .cif can also be requested from the Download: ligands page. The user will then have to rename the .cif to match the PDB for alignment / upload.

mwinokan commented 6 months ago

@tdudgeon says that the .cif from the PDB download described above is not immediately compatible. Tim suggests: Treat the CONECT records in the PDB as the 'truth' unless overridden in the config.yaml by a smiles string or .cif specified by the user.

@ConorFWild says the curator needs clear instructions as to which ligand file inputs are required so we don't need to support edge cases.

@mwinokan: The .sdf and .mol2 ligand 'instances' from the pdb entry can be downloaded with bond order. If the user provides these, is that enough? The .mol2 files does not have the kekulization but the .sdf does @mwinokan to check with Jenke/Ed whether they want kekulisation

@phraenquex the ligand coordinates and bond orders should be taken from the .sdf obtained from the PDB download. This way the curator can specify exactly which observation they want

mwinokan commented 6 months ago

@tdudgeon I have confirmed with Jenke. Aromatic is fine, no need for kekulisation

tdudgeon commented 6 months ago

I have the handling of PDBs almost ready.

I looked at using the MOL2 or SDF files from PDB but those aren't actually much use. Both, like the CIF, are just in a random bit of space, and don't correspond to the real coordinates. The SDF does not include atom labels so makes it difficult to match the atoms from the PDB with the coords to the atoms in the molecular graph. The MOL2 does include the atom labels, but RDKit's support for MOL2 format is not great so it's probably best to stay away from that. So I've stuck with using the CIF so that the logic can be shared with what's used for the XChem data. But the PDB CIF's are slightly different to the XChem ones and I needed to make a few assumptions that may not hold true for all CIFs. A key one is to be agnostic to the ligand name which for PDB data is not necessarily LIG. Currently the .pdb file needs to be edited to make the ligand called LIG, but that's not needed for the .cif file, which makes the CIF inconsistent with the PDB (that will change once we make the ligand name configurable). You can see the gory details here: https://github.com/xchem/xchem-align/blob/handle_pdbs/src/xchemalign/utils.py#L335-L399

Some remaining issues or improvements

Currently this code is still on a branch. Need to decide whether to roll it out to the users now.
Needs testing with a wider range of data from PDB
Need to implement generating the ligand directly from the PDB where CONECT records are present, and probably use that instead of using the CIF.
Need to consider how to generate the ligand PDB. Up to now we've used RDKit to write the file from the generated ligand, butt that file is not very compatible. Probably better to just generate it directly from the PDB, especially where CONECT records are present (and maybe they can be autogenerated based on the bonds from the generated ligand)
Implement allowing the ligand name to be specified, so as to avoid the need to edit the PDBs

mwinokan commented 6 months ago

@tdudgeon says that there are issues with PDB's that have ligands but no PanDDa event maps, where unnecessary files are written. Should be a quick fix, and then can promote branch.

mwinokan commented 6 months ago

@tdudgeon is waiting for @ConorFWild's (superseding) changes, but the dev for the external PDB's is finished.

The superseding #1311 changes will require target loader changes (@kaliif).

mwinokan commented 6 months ago

@tdudgeon says that the CIF's from RCSB may have inconsistencies w.r.t. the XChem CIF's. I.e. some molecule properties may have different field names, etc. @tdudgeon has tested against all of Jasmin's CHIKV PDB entries, but there may be others that cause issues.

The atom labels in the CIF and PDB file are matches, all observations in the PDB are aligned by XCA and will need review by the data curator

Currently the ligand in the PDB file must be renamed by the data curator to LIG. The spec required is that the data curator must only supply the CIF and the ligand name can be inferred from the data therein. @tdudgeon and @ConorFWild to resolve this please

Once all the fixes have been implemented @mwinokan to coordinate either Jasmin or Ryan to test this and #1311

tdudgeon commented 6 months ago

The first part of handling ligands that are not called LIG has been done. The ligand CIF file is inspected for the ligand name, and the name is written to the metadata as the crystals.<crystal>.crystallographic_files.ligand_cif.ligand_name property e.g. ligand_name: LIG. The next stage is for @ConorFWild to update XCA/LNA to use this to identify the ligand. Then I need to update the molecule generation to use that property to extract the ligand from the PDB.

Until that is done the PDB files need to be manually edited to change the ligand name to LIG (but the CIF files should NOT be edited).

Once all these changes are done the PDB files will not need to be edited, BUT the PDB and the CIF MUST use the same ligand name.

mwinokan commented 6 months ago

@tdudgeon is expecting to roll this out today, but superseding has not been tested against these changes.

mwinokan commented 6 months ago

There are issues with:

Superseding
Missing waters

ConorFWild commented 6 months ago

Specifically:

Missing Waters: Currently chains without binding residues are removed, however waters may be in a water chain of arbitrary name. As such they are removed. Possibly the easiest thing to do is append all waters of any chain within some radius (symmetric) to the ligand chain?

Superseding: it is actually an extremely rare issue that only occurs for ligand neighbourhoods that not only cannot be aligned to any other neighbourhood, but have insufficiently many CAs nearby to be aligned to themselves as well - hence version 2 cannot be aligned to version 1 which leads conformer site generation to believe it is geometrically distinct

mwinokan commented 6 months ago

Missing waters

Waters to be selected by their distance to the ligand chain.

@ConorFWild says that there are workarounds to include waters in non-binding chains and add them to them to the aligned PDB. Decisions need to be made whether:

The waters are added to the ligand chain or kept as their own chain e.g. `W`?

@phraenquex says keep the chains. @ConorFWild says that keeping the chains makes the logic more complicated, but seems the logical choice. If the waters are associated to some far-away chain then some symmetry operations may be unexpectedly applied to them.

@phraenquex asks if expanding the space groups and solving the translational periodic boundary conditions first, and then selecting waters will solve the transformational problems? @ConorFWild's concern is tracing the atoms in the aligned PDB back to their experimental origin. @phraenquex says that knowing exactly which chain the water originated from (in the aligned files) is not crucial (the crystallographic files can be interrogated if needed).

Inconsistencies in chain names should not occur (i.e. able to trace water back to the crystallographic atom index and chain) and but no record of symmetry-image transformation operations for the waters are needed.

Superseding

Error with A71EV2A, if you have a ConfSite that only has a single member because it cannot be aligned to anything else, or even itself (not enough C-alphas). If you supersede that site, it cannot be solved because there is no superposition transform.

Possible fixes:

Increase the radius of the c-alpha selection
Decrease the number of required c-alpha atoms required

Both of these require changes to the most important hyper parameters of the algorithm.

If the new upload has no exemplars of the singleton ConformerSite, you will end up with two ConformerSites, one for each version. If the backend only serves ConformerSites that contain the latest data then the should not be a problem.

@ConorFWild says this is an edge case, so we will deal with the ConformerSite duplication in the backend. See this comment

mwinokan commented 6 months ago

@tdudgeon has found a new (hopefully just technical) bug in the superseding logic and will investigate.

mwinokan commented 6 months ago

Re: superseding for A71EV2A singleton ConformerSites

The discussion between @phraenquex and @ConorFWild concluded that XCA should not have any new logic to deal with these edge cases.

@kaliif could you please change the backend to only serve ConformerSites that contain "live" observation? @ConorFWild please fill @kaliif in on extra details

mwinokan commented 6 months ago

@mwinokan and @tdudgeon to fill in @kaliif on the changes needed for superseding.

@ConorFWild is working on the missing waters, current estimate is it will be ready on Tuesday

mwinokan commented 6 months ago

@kaliif Backend to only serve (through the API) ConformerSites that contain datasets that have not been superseded. The old version should still be stored in the database (for future use in snapshots/downloads referencing old data).

Example YAML from @tdudgeon:

conformer_sites:
  A71EV2A-x0202+A+201+1:
    members: [A71EV2A-x0202/A/201/1]
    reference_ligand_id: A71EV2A-x0202/A/201/1
    residues: [A/72/TYR, A/69/ARG, A/65/CYS, A/66/SER, A/71/HIS, A/64/TYR, A/67/SER,
      A/70/LYS, A/68/ARG]
  A71EV2A-x0202+A+201+2:
    members: [A71EV2A-x0202/A/201/2]
    reference_ligand_id: A71EV2A-x0202/A/201/2
    residues: [A/66/SER, A/64/TYR, A/65/CYS, A/68/ARG, A/72/TYR, A/71/HIS, A/67/SER,
      A/70/LYS, A/69/ARG]

From the above the first ConformerSite should be ignored.

mwinokan commented 6 months ago

@ConorFWild's changes for the waters have been implemented (likely some edge cases missing but done for now)

@kaliif's b/e changes for superseding are in staging

Jasmin will test upload_2 for CHIKV on staging to asses if the superseding (#1311) functionality is complete

m2ms / fragalysis-frontend