openmm / pdbfixer

PDBFixer fixes problems in PDB files
Other
443 stars 112 forks source link

Add nonstandard residue '5OW' to LYS #258

Closed Ruibin-Liu closed 1 year ago

Ruibin-Liu commented 1 year ago

https://www.rcsb.org/ligand/5OW is based on LYS and appears in PDB 5EIG.

peastman commented 1 year ago

Thanks! It should have been detecting this substitution already based on the MODRES record. Was that not working? If so, we should figure out why.

Ruibin-Liu commented 1 year ago

Thanks! It should have been detecting this substitution already based on the MODRES record. Was that not working? If so, we should figure out why.

The reason might be that I was using the pdbx format (.cif file) of its bio-assembly structure. I checked the code for reading the modified residue information: https://github.com/openmm/pdbfixer/blob/92f73cccf6871e406e3c18bf2416afd7e7638951/pdbfixer/pdbfixer.py#L336

That block appears in a raw pdbx file but not in the bioassembly pdbx file. For the latter, it seems the only possible way to match the non-standard to standard residues is to look up the '_entity_poly' block. Good news is that the '_entity_poly' also appears in the raw pdbx files. So we may change the algorithm to read the '_entity_poly' block instead?

Since I am using the bio-assembly files for many PDB structures that contain modified residues, I, for now, have to add them manually if they are not in the list already.

Ruibin-Liu commented 1 year ago

Also, the 'MODRES' record is not in the PDB format bio-assembly file and I don't see a way to find out the information in such a file.

peastman commented 1 year ago

That's annoying. I wonder why they don't include the information?

Anyway, PDBx/mmCIF is a problematic file format. (PDB is too, just with different problems.) It's redundant, with multiple places the same information can be stored. And the documentation tends to be ambiguous and sometimes contradictory about where information should be stored.

I implemented it based on the PDB to PDBx/mmCIF Data Item Correspondences documentation, which says you should use _pdbx_struct_mod_residue as the equivalent for MODRES. It looks like you could potentially get the same information from _entity_poly.pdbx_seq_one_letter_code_can. It would be a lot more challenging and ambiguous, though. _pdbx_struct_mod_residue gives you an exact identifier for the modified residue. _entity_poly gives you a sequence for the biological polymer, which might or might not be easy to match up to the atom data. Assuming it's present, which the documentation doesn't guarantee.

Ruibin-Liu commented 1 year ago

It looks like you could potentially get the same information from _entity_poly.pdbx_seq_one_letter_code_can

From the documentation of _entity_poly.pdbx_seq_one_letter_code_can, it's the canonical sequence "...corresponding to the sequence in _entity_poly.pdbx_seq_one_letter_code. Non-standard amino acids/nucleotides are represented by the codes of their parents if parent is specified in...". A straightforward implementation is to build a dict matching two sequences and then find out the corresponding modified:parent pairs.

peastman commented 1 year ago

Right, but that sequence won't necessarily (i.e. usually won't) match the sequence of residues found in the atom_site records. It also isn't guaranteed to match the sequence found in entity_poly_seq. So you have all the usual problems of trying to match up sequences.

Ruibin-Liu commented 1 year ago

Right, but that sequence won't necessarily (i.e. usually won't) match the sequence of residues found in the atom_site records. It also isn't guaranteed to match the sequence found in entity_poly_seq. So you have all the usual problems of trying to match up sequences.

I am thinking about hooking up my imperfect script of finding the parent residue of a nonstandard one to pdbfixer. From my reading, fixer.nonstandardResidues should be the place I can add, delete, or change the default residue replacing method. fixer.nonstandardResidues only contains those in the MODRES (or _pdbx_struct_mod_residue) and the substitutions dict. Each element is a tuple of (Residue, Name) where Residue is an object defined in OpenMM which I think it's verbose to construct manually, and Name is a 3-letter representation of a standard amino acid which is certainly easy to manipulate. So there are two solutions, either changing the substitutions dict or constructing the Residue object manually for the failed residues using try except catching. I think the former is much easier. We can just expose the substitutions dict in the __init__.py file or move it to the PDBFixer class.

What do you think?

peastman commented 1 year ago

The Residue comes from the Topology. That is, it should be a member of fixer.topology.residues(). Constructing a new Residue object would not work correctly. You have a Topology that has been loaded from a file, and you specify which of the Residues in that Topology should be replaced.

Ruibin-Liu commented 1 year ago

If I understand your reply correctly, you also agree we should not construct a Residue object manually. So the 'only' solution if we have a non-standard residue that's not in the MODRES record in PDB but still want to replace it with a desired standard one in python is to change the substitutions dict directly. What I proposed is to expose the substitutions to the user so that we don't need to change the pdbfixer.py code every time there is a new one. It can be done either adding it to __init__.py or move it to the 'PDBFixer' class.

peastman commented 1 year ago

There are a couple of supported ways of substituting residues. One is to modify nonstandardResidues. Here is what the manual says about it:

findNonstandardResidues() stores the results into the nonstandardResidues field, which is a list. Each entry is a tuple whose first element is a Residue object and whose second element is the name of the suggested replacement residue. Before calling replaceNonstandardResidues() you can modify the contents of this list. For example, you can remove an entry to prevent a particular residue from being replaced, or you can change what it will be replaced with. You can even add new entries if you want to mutate other residues.

Another way of doing it is to call applyMutations():

https://github.com/openmm/pdbfixer/blob/db2886903fe835919695c465fd20a9ae3b2a03cd/pdbfixer/pdbfixer.py#L729-L761