Closed Ruibin-Liu closed 1 year ago
Thanks! It should have been detecting this substitution already based on the MODRES record. Was that not working? If so, we should figure out why.
Thanks! It should have been detecting this substitution already based on the MODRES record. Was that not working? If so, we should figure out why.
The reason might be that I was using the pdbx format (.cif file) of its bio-assembly structure. I checked the code for reading the modified residue information: https://github.com/openmm/pdbfixer/blob/92f73cccf6871e406e3c18bf2416afd7e7638951/pdbfixer/pdbfixer.py#L336
That block appears in a raw pdbx file but not in the bioassembly pdbx file. For the latter, it seems the only possible way to match the non-standard to standard residues is to look up the '_entity_poly' block. Good news is that the '_entity_poly' also appears in the raw pdbx files. So we may change the algorithm to read the '_entity_poly' block instead?
Since I am using the bio-assembly files for many PDB structures that contain modified residues, I, for now, have to add them manually if they are not in the list already.
Also, the 'MODRES' record is not in the PDB format bio-assembly file and I don't see a way to find out the information in such a file.
That's annoying. I wonder why they don't include the information?
Anyway, PDBx/mmCIF is a problematic file format. (PDB is too, just with different problems.) It's redundant, with multiple places the same information can be stored. And the documentation tends to be ambiguous and sometimes contradictory about where information should be stored.
I implemented it based on the PDB to PDBx/mmCIF Data Item Correspondences documentation, which says you should use _pdbx_struct_mod_residue
as the equivalent for MODRES
. It looks like you could potentially get the same information from _entity_poly.pdbx_seq_one_letter_code_can
. It would be a lot more challenging and ambiguous, though. _pdbx_struct_mod_residue
gives you an exact identifier for the modified residue. _entity_poly
gives you a sequence for the biological polymer, which might or might not be easy to match up to the atom data. Assuming it's present, which the documentation doesn't guarantee.
It looks like you could potentially get the same information from
_entity_poly.pdbx_seq_one_letter_code_can
From the documentation of _entity_poly.pdbx_seq_one_letter_code_can, it's the canonical sequence "...corresponding to the sequence in _entity_poly.pdbx_seq_one_letter_code
. Non-standard amino acids/nucleotides are represented by the codes of their parents if parent is specified in...". A straightforward implementation is to build a dict matching two sequences and then find out the corresponding modified:parent
pairs.
Right, but that sequence won't necessarily (i.e. usually won't) match the sequence of residues found in the atom_site
records. It also isn't guaranteed to match the sequence found in entity_poly_seq
. So you have all the usual problems of trying to match up sequences.
Right, but that sequence won't necessarily (i.e. usually won't) match the sequence of residues found in the
atom_site
records. It also isn't guaranteed to match the sequence found inentity_poly_seq
. So you have all the usual problems of trying to match up sequences.
I am thinking about hooking up my imperfect script of finding the parent residue of a nonstandard one to pdbfixer. From my reading, fixer.nonstandardResidues
should be the place I can add, delete, or change the default residue replacing method. fixer.nonstandardResidues
only contains those in the MODRES
(or _pdbx_struct_mod_residue
) and the substitutions
dict. Each element is a tuple of (Residue, Name) where Residue
is an object defined in OpenMM which I think it's verbose to construct manually, and Name
is a 3-letter representation of a standard amino acid which is certainly easy to manipulate. So there are two solutions, either changing the substitutions
dict or constructing the Residue
object manually for the failed residues using try except
catching. I think the former is much easier. We can just expose the substitutions
dict in the __init__.py
file or move it to the PDBFixer
class.
What do you think?
The Residue comes from the Topology. That is, it should be a member of fixer.topology.residues()
. Constructing a new Residue object would not work correctly. You have a Topology that has been loaded from a file, and you specify which of the Residues in that Topology should be replaced.
If I understand your reply correctly, you also agree we should not construct a Residue object manually. So the 'only' solution if we have a non-standard residue that's not in the MODRES
record in PDB but still want to replace it with a desired standard one in python is to change the substitutions
dict directly. What I proposed is to expose the substitutions
to the user so that we don't need to change the pdbfixer.py
code every time there is a new one. It can be done either adding it to __init__.py
or move it to the 'PDBFixer' class.
There are a couple of supported ways of substituting residues. One is to modify nonstandardResidues
. Here is what the manual says about it:
findNonstandardResidues()
stores the results into thenonstandardResidues
field, which is a list. Each entry is a tuple whose first element is a Residue object and whose second element is the name of the suggested replacement residue. Before callingreplaceNonstandardResidues()
you can modify the contents of this list. For example, you can remove an entry to prevent a particular residue from being replaced, or you can change what it will be replaced with. You can even add new entries if you want to mutate other residues.
Another way of doing it is to call applyMutations()
:
https://www.rcsb.org/ligand/5OW is based on LYS and appears in PDB 5EIG.