openmm / pdbfixer

PDBFixer fixes problems in PDB files
Other
456 stars 115 forks source link

Add option to retain residue numbering and chain ids; should be default #54

Closed jchodera closed 9 years ago

jchodera commented 10 years ago

Currently, pdbfixer spits out PDB files renumbered to start from 1 and chain id A. This is probably not what we want, since pdbfixer is just supposed to make it possible to simulate the system while retaining as much information as the user desires, rather than changing information like residue numbers and chain ids.

aparente commented 10 years ago

Another problem I've had with a couple different pdbs. A lot of the structures I've encountered are missing the initiator methionine so they start at residue 2.

For example: It re-ordered the residue numbers for another B subunit structure (1EI1 http://www.rcsb.org/pdb/explore/explore.do?structureId=1ei1) but left the chain ID's alone.

PDBfixer definitely re-numbers my residues but won't always change my chain ID's. May just be due to the way they were ordered in the PDB. I'd have to go back and check.

For example, I'm building an A2B2 heterotetramer (2 A and 2 B subunits), where the structure is composed of two co-crystallographic AB fragments (PDB: 4PLB http://www.rcsb.org/pdb/explore/explore.do?structureId=4PLB) that are aligned to a cryo-EM map. This contains 4 chains, where they are named:

A1 = A B1 = B A2 = C B2 = D

It re-ordered the residues to start at 1 and switched the A and B chains, but left C and D alone. Didn't realize it until I tried using it to model a missing loop between this structure and its CTD fragment (PDB 3LV6) and it tried telling me the seqres files didn't match the coordinates. This is a pretty modified pdb so I'm not sure it'd make the best test case, but can send it if you like.

On 9 Aug 2014, at 19:20, John Chodera wrote:

Currently, pdbfixer spits out PDB files renumbered to start from 1 and chain id A. This is probably not what we want, since pdbfixer is just supposed to make it possible to simulate the system while retaining as much information as the user desires, rather than changing information like residue numbers and chain ids.


Reply to this email directly or view it on GitHub: https://github.com/SimTk/pdbfixer/issues/54

Angelica C. Parente PhD Candidate, Biophysics Program Bryant and Pande Labs Stanford University http://pande.stanford.edu http://web.stanford.edu/group/bryant/

jchodera commented 10 years ago

Thanks for this problem case example, @aparente!

I think the proper behavior should be to retain the original chain names and residue numberings, unless an API method renumberResidues() and reorderChains() (or equivalent GUI options) are used.

@peastman: Would you agree? If so, I can try to work out a PR. We need this capability ASAP for a project.

jchodera commented 10 years ago

Also, @peastman, I'd like to augment the Topology object with additional metadata for chain names, since this seems to be the main impediment to making this PR easy right now.

peastman commented 10 years ago

Sounds fine on both counts.

jchodera commented 10 years ago

Thanks!

aparente commented 10 years ago

Confirmed that PDBfixer re-orders chains based on what order the chains are originally listed.

jchodera commented 10 years ago

@aparente : Do you mean that the original chain order is preserved (even if non-alphabetical) when calling reorderChains()?

aparente commented 10 years ago

This was using the broswer based app, I was just using it to replace missing heavy atoms, wasn't using the API. The original order of chains was "B A C D" and it renamed my A and B chains such that the order was "A B C D". So it keeps the order in which the chains are listed in the pdb, but re-names them so they are in alphabetical order. Was just confirming something I pointed out in my original comment.

jchodera commented 10 years ago

Thanks for confirming, @aparente! I'll take a stab at fixing this, but it will probably have to wait until after @peastman's OpenMM 6.1 release feature freeze to get merged in.

jchodera commented 9 years ago

This would be a good feature to include in a first release.

jchodera commented 9 years ago

Did @peastman already make changes to have PDBFixer retain residue numbering? I'm noting issues with this...

peastman commented 9 years ago

No, I haven't done this.

jchodera commented 9 years ago

What would it take to do this?

jchodera commented 9 years ago

Any thoughts here? We could really use this.

peastman commented 9 years ago

This is more challenging that it seems at first. Somehow we need to guarantee that we're outputting valid, consistent identifiers. Atom and residue ids need to be numeric, positive, in order, no more than 4 (for residues) or 5 (for atoms) digits, and unique. Chain ids need to be single upper case letters, and again must be unique. Usually the initial ids you load from the input file will satisfy that, but as soon as you start doing any editing of the topology it becomes a lot harder. And even the input ids won't necessary be valid. For example, if you load a PDBx file, the chain ids will be arbitrary strings. (I know, I haven't implemented PDBx input yet, but it's one of the next things on my list.)

Right now the way it handles this is for the PDB writer to just ignore existing ids and generate new, guaranteed valid ones while writing the output file.

jchodera commented 9 years ago

Right now the way it handles this is for the PDB writer to just ignore existing ids and generate new, guaranteed valid ones while writing the output file.

This comes at the price of a huge amount of information loss and potentially making the output unable to be analyzed in a useful way.

I agree that there are challenges, but it seems like adding a mode that does the following would be at least conceptually straightforward:

Our immediate use case is that we want to generate a bunch of mutations starting from the same structure, but the initial PDB file (from the RCSB) is missing a lot of residues. If we use PDBFixer to model the mutations and build in the missing residues for each mutation, each instance ends up with the loops modeled in a different random way. If we instead first build in the loops, we destroy the information about residue numbering and would have to figure out how things got re-mapped to make the mutations. In either case, the resulting file becomes much harder to analyze because the canonical numbering is lost.