Open phraenquex opened 9 months ago
The loader B/E will also need to support these non-XChem PDB's (called manual
in the XCA config.yaml).
We need a spec for the observation short-codes, maybe the following:
A71
and XChem experiment: x0310
will result in observations with shortcode such as A71x0310a
A71
and protein data bank (RCSB PDB) code: 8R4I
will result in observations with shortcode such as A71_8R4Ia
. The _
is a placeholder delimiter.@phraenquex what do you think?
It will almost always be sufficient to use the PDBID directly, e.g. 8r4i
or 3de7
or whatnot.
But it makes me realise that what's needed is a general mechanism to map experiment names to short codes ==> new ticket, #1383.
The observation shortcodes for data from the RCSB PDB should be formatted as follows:
E.g. 8R4I-a
@mwinokan to check with Daren
Confirmed with Daren: it may be desirable to have a prefix defined, in which case I would suggest using whatever is in the config.yaml
.
So the spec for observation shortcodes for structures from the RCSB PDB is:
{prefix}-{pdb_code}-{observation_letter}
e.g. : A71-3PJR-a
@phraenquex and @mwinokan to resolve this and #1383
@phraenquex and I spoke about this yesterday, until we resolve #1383 we believe the prefix is not needed, the user can just change the directory name to include the prefix for now.
@tdudgeon does this mean that an alignment of files solely from RCSB PDBs is easier to achieve?
I think this means that there is no real impact on XCA. It's a matter for how the target loader would handle these. @kaliif can you assess this?
Where I say _no real impact on XCA, that's not quite correct because the code_prefix
and code_prefix_tooltip
are mandatory values for all inputs (including manual
) and would be written into the outputs, but the target loader would need to ignore these for manual
inputs.
We could change this in XCA so that code_prefix
and code_prefix_tooltip
are not handled for manual
inputs, but still there would likely need to be changes in target loader.
The user will need to specify a code_prefix
but this can be empty. Continue to ignore code_prefix
's for manual inputs
The name is taken from the name of the subdirectory for each manual input. All manual inputs should be in one directory and the filename specifies the name as uploaded (not subdirectories).
@tdudgeon please implement the above two in
Target loader will need to accept files without the xchem syntax (e.g. x????
) for manual inputs (not model_building). @kaliif check the type flag and generate a shortcode using: {pdb_code}-{letter}
where letter is used to specify the observation as with XChem data.
How do we expose metadata about the origin of the PDB (i.e. "this is from the RCSB PDB" or "someone modified this file and sent it to me"?) See #1383
Adding to this: needs changes in XCA (@tdudgeon) and loader (@kaliif).
@kaliif: we also need support for descriptions.csv
, which should be:
fileroot
,free text that describes file
If it's present in the subdirectory, then ingest it to B/E.
If not obvious where to store: presumably (for now) as extra column in theobservations
table.
@tdudgeon says XCA work is done:
@kaliif has added changes into staging that will load all 'manual' files.
@mwinokan to work with Jasmin to check the new documentation and try uploading some files to staging.
In theory it is also possible to align a bunch of files from the protein data bank and upload them as a new target. @mwinokan to get Andre to test with Zika helicase.
Tooltip to be handled in #1402
@tdudgeon says he needs to talk to @ConorFWild because there are a few problems
LIG
in files from the PDB@phraenquex says that #1332 may help cover 3. and 4.
@tdudgeon says this is closer to working than anticipated. Tim has it working for basic RCSB PDB's but the requisite is that there is a ligand .CIF file.
Can the CIF files be reliable downloaded from the PDB? @mwinokan to find out.
@tdudgeon ask Jasmin to share her PDB's and CIF's for RCSB data in the #xchem-align channel
@ConorFWild says that substructure matching between PDB's and CIF's can be complex (e.g. symmetry edge cases and inconsistency in CIF formatting)
@tdudgeon
For a given PDB entry: e.g. 6VUQ for CHIKV_Mac (pdb link) under 'small molecules' there is the button to "download ideal coordinates CDD file" which requests the .cif of the ligand. By using the ligand name "APR" the .cif can also be requested from the Download: ligands page. The user will then have to rename the .cif to match the PDB for alignment / upload.
@tdudgeon says that the .cif from the PDB download described above is not immediately compatible. Tim suggests: Treat the CONECT records in the PDB as the 'truth' unless overridden in the config.yaml by a smiles string or .cif specified by the user.
@ConorFWild says the curator needs clear instructions as to which ligand file inputs are required so we don't need to support edge cases.
@mwinokan: The .sdf and .mol2 ligand 'instances' from the pdb entry can be downloaded with bond order. If the user provides these, is that enough? The .mol2 files does not have the kekulization but the .sdf does @mwinokan to check with Jenke/Ed whether they want kekulisation
@phraenquex the ligand coordinates and bond orders should be taken from the .sdf obtained from the PDB download. This way the curator can specify exactly which observation they want
@tdudgeon I have confirmed with Jenke. Aromatic is fine, no need for kekulisation
I have the handling of PDBs almost ready.
I looked at using the MOL2 or SDF files from PDB but those aren't actually much use. Both, like the CIF, are just in a random bit of space, and don't correspond to the real coordinates. The SDF does not include atom labels so makes it difficult to match the atoms from the PDB with the coords to the atoms in the molecular graph. The MOL2 does include the atom labels, but RDKit's support for MOL2 format is not great so it's probably best to stay away from that. So I've stuck with using the CIF so that the logic can be shared with what's used for the XChem data. But the PDB CIF's are slightly different to the XChem ones and I needed to make a few assumptions that may not hold true for all CIFs. A key one is to be agnostic to the ligand name which for PDB data is not necessarily LIG. Currently the .pdb file needs to be edited to make the ligand called LIG, but that's not needed for the .cif file, which makes the CIF inconsistent with the PDB (that will change once we make the ligand name configurable). You can see the gory details here: https://github.com/xchem/xchem-align/blob/handle_pdbs/src/xchemalign/utils.py#L335-L399
Some remaining issues or improvements
@tdudgeon says that there are issues with PDB's that have ligands but no PanDDa event maps, where unnecessary files are written. Should be a quick fix, and then can promote branch.
@tdudgeon is waiting for @ConorFWild's (superseding) changes, but the dev for the external PDB's is finished.
The superseding #1311 changes will require target loader changes (@kaliif).
@tdudgeon says that the CIF's from RCSB may have inconsistencies w.r.t. the XChem CIF's. I.e. some molecule properties may have different field names, etc. @tdudgeon has tested against all of Jasmin's CHIKV PDB entries, but there may be others that cause issues.
The atom labels in the CIF and PDB file are matches, all observations in the PDB are aligned by XCA and will need review by the data curator
Currently the ligand in the PDB file must be renamed by the data curator to LIG
. The spec required is that the data curator must only supply the CIF and the ligand name can be inferred from the data therein. @tdudgeon and @ConorFWild to resolve this please
Once all the fixes have been implemented @mwinokan to coordinate either Jasmin or Ryan to test this and #1311
The first part of handling ligands that are not called LIG has been done. The ligand CIF file is inspected for the ligand name, and the name is written to the metadata as the crystals.<crystal>.crystallographic_files.ligand_cif.ligand_name
property e.g. ligand_name: LIG
.
The next stage is for @ConorFWild to update XCA/LNA to use this to identify the ligand.
Then I need to update the molecule generation to use that property to extract the ligand from the PDB.
Until that is done the PDB files need to be manually edited to change the ligand name to LIG (but the CIF files should NOT be edited).
Once all these changes are done the PDB files will not need to be edited, BUT the PDB and the CIF MUST use the same ligand name.
@tdudgeon is expecting to roll this out today, but superseding has not been tested against these changes.
There are issues with:
Specifically:
Missing Waters: Currently chains without binding residues are removed, however waters may be in a water chain of arbitrary name. As such they are removed. Possibly the easiest thing to do is append all waters of any chain within some radius (symmetric) to the ligand chain?
Superseding: it is actually an extremely rare issue that only occurs for ligand neighbourhoods that not only cannot be aligned to any other neighbourhood, but have insufficiently many CAs nearby to be aligned to themselves as well - hence version 2 cannot be aligned to version 1 which leads conformer site generation to believe it is geometrically distinct
Waters to be selected by their distance to the ligand chain.
@ConorFWild says that there are workarounds to include waters in non-binding chains and add them to them to the aligned PDB. Decisions need to be made whether:
W
?@phraenquex says keep the chains. @ConorFWild says that keeping the chains makes the logic more complicated, but seems the logical choice. If the waters are associated to some far-away chain then some symmetry operations may be unexpectedly applied to them.
@phraenquex asks if expanding the space groups and solving the translational periodic boundary conditions first, and then selecting waters will solve the transformational problems? @ConorFWild's concern is tracing the atoms in the aligned PDB back to their experimental origin. @phraenquex says that knowing exactly which chain the water originated from (in the aligned files) is not crucial (the crystallographic files can be interrogated if needed).
Inconsistencies in chain names should not occur (i.e. able to trace water back to the crystallographic atom index and chain) and but no record of symmetry-image transformation operations for the waters are needed.
Error with A71EV2A, if you have a ConfSite that only has a single member because it cannot be aligned to anything else, or even itself (not enough C-alphas). If you supersede that site, it cannot be solved because there is no superposition transform.
Possible fixes:
Both of these require changes to the most important hyper parameters of the algorithm.
If the new upload has no exemplars of the singleton ConformerSite, you will end up with two ConformerSites, one for each version. If the backend only serves ConformerSites that contain the latest data then the should not be a problem.
@ConorFWild says this is an edge case, so we will deal with the ConformerSite duplication in the backend. See this comment
@tdudgeon has found a new (hopefully just technical) bug in the superseding logic and will investigate.
Re: superseding for A71EV2A singleton ConformerSites
The discussion between @phraenquex and @ConorFWild concluded that XCA should not have any new logic to deal with these edge cases.
@kaliif could you please change the backend to only serve ConformerSites that contain "live" observation? @ConorFWild please fill @kaliif in on extra details
@mwinokan and @tdudgeon to fill in @kaliif on the changes needed for superseding.
@ConorFWild is working on the missing waters, current estimate is it will be ready on Tuesday
@kaliif Backend to only serve (through the API) ConformerSites that contain datasets that have not been superseded. The old version should still be stored in the database (for future use in snapshots/downloads referencing old data).
Example YAML from @tdudgeon:
conformer_sites:
A71EV2A-x0202+A+201+1:
members: [A71EV2A-x0202/A/201/1]
reference_ligand_id: A71EV2A-x0202/A/201/1
residues: [A/72/TYR, A/69/ARG, A/65/CYS, A/66/SER, A/71/HIS, A/64/TYR, A/67/SER,
A/70/LYS, A/68/ARG]
A71EV2A-x0202+A+201+2:
members: [A71EV2A-x0202/A/201/2]
reference_ligand_id: A71EV2A-x0202/A/201/2
residues: [A/66/SER, A/64/TYR, A/65/CYS, A/68/ARG, A/72/TYR, A/71/HIS, A/67/SER,
A/70/LYS, A/69/ARG]
From the above the first ConformerSite should be ignored.
@ConorFWild's changes for the waters have been implemented (likely some edge cases missing but done for now)
@kaliif's b/e changes for superseding are in staging
Jasmin will test upload_2 for CHIKV on staging to asses if the superseding (#1311) functionality is complete
XCA should be doing the downloading of files from the PDB. Including extracting event maps, if they were uploaded.
This won't be hard, just needs working out a few arcane details of where to find the relevant information.
And needs YAML keywords.