openmm / pdbfixer

PDBFixer fixes problems in PDB files
Other
473 stars 114 forks source link

What to do about residues without computer-readable molecular definitions? #31

Open jchodera opened 10 years ago

jchodera commented 10 years ago

I ran into the case of 1AO9, which contains a DOP residue that is not resolved in the ATOM records. This residue does not appear in a machine-readable form in the file or elsewhere at the RCSB, it seems, though I have emailed the RCSB to confirm this.

Many of the HETATM ligands appear here in this nicely curated Ligand Expo, where even SDF files can be downloaded, but no such resource appears to exist for protein residues.

The only clue in 1AO9 is the COMPND header:

COMPND   3 D(*GP*AP*GP*AP*GP*AP*DOP*TP*CP*TP*CP*TP*C)-3');                      
COMPND   4 CHAIN: A;                                                            
COMPND   5 ENGINEERED: YES;                                                     
COMPND   6 OTHER_DETAILS: DI-(OCTYLPHOSPHATE) LINKER BETWEEN PURINE             
COMPND   7 AND PYRIMIDINE STRANDS                                               

which states that this DOP residue is a "di-(octylphosphate) linker between purine and pyrimidine strands", but this would be immensely difficult for a machine to parse.

I guess this means we simply cannot hope to treat these residues in any sensible way. But what should the default behavior be here? Simply omit them, causing a chain break? Make a random substitution?

peastman commented 10 years ago

The full chemical component dictionary can be found at ftp://ftp.wwpdb.org/pub/pdb/data/monomers. (Be prepared for that to bring your web browser to a crawl!) DOP (ftp://ftp.wwpdb.org/pub/pdb/data/monomers/DOP) is dioctylphosphate.

As for how to handle it, I really don't know. I'm not sure there's any universal rule.

jchodera commented 10 years ago

Here's what the PDB said:

Begin forwarded message:

From: Rachel Kramer Green <kramer@rcsb.rutgers.edu>
Subject: Re: Data-related : Master index of chemical definitions of residue names? (help-5469)
Date: April 14, 2014 at 8:50:44 AM EDT
To: <info@rcsb.org>, <choderaj@mskcc.org>
Reply-To: info <info@rcsb.org>

Thank you for your email message.

Please take a look at the Chemical Component Dictionary at:
http://www.wwpdb.org/ccd.html

These residues have not been experimentally determined (they do not appear in the coordinates) and thus it is only by the authors statement  that they appear in the file, and are thus not defined further within the file.

You may also find the following article on missing residues to be helpful:
http://www.rcsb.org/pdb/101/static101.do?p=education_discussion/Looking-at-Structures/missing.html

Sincerely,
Rachel Green

Rachel Kramer Green, Ph.D.
RCSB PDB
kramer@rcsb.rutgers.edu

New! Deposit X-ray data with the wwPDB at:
http://deposit.wwpdb.org/deposition (NMR and 3DEM coming soon).
___________________________________________________________
Twitter: https://twitter.com/#!/buildmodels
Facebook: http://www.facebook.com/RCSBPDB
jchodera commented 10 years ago

As for how to handle it, I really don't know. I'm not sure there's any universal rule.

I think we want a few main modes for pdbfixer:

In general mode, we can ask it to:

In the forcefield-aware mode, we can ask it to:

For unusual residues or ligands, we can fetch their definitions from the components library (remote or local) and build your minimal forcefield based on that.

peastman commented 10 years ago

One issue that could be challenging to deal with: for each of the standard residues, we have a template giving a reasonable starting structure for it. We won't have that for nonstandard ones. In principle you can work one out from the force field, but it will be nontrivial.

jchodera commented 10 years ago

One issue that could be challenging to deal with: for each of the standard residues, we have a template giving a reasonable starting structure for it. We won't have that for nonstandard ones

But the components.cif file (the Chemical Component Dictionary mentioned above) does have a template for a reasonable starting structure for every residue appearing in the PDB.

jchodera commented 10 years ago

components.cif when compressed is 36M, but we would only need the coordinates and a bit of metadata for each residue, so we could conceivably ship a whole copy of this along with pdbfixer. Alternatively, there may be a way to grab just the residues one needs from the PDB in a just-in-time manner.

There aren't any Python mmCIF reader libraries that I know about, but the data we would need from this file is minimal, and could probably be easily transformed into a usable form.