Closed jchodera closed 9 years ago
Another problem I've had with a couple different pdbs. A lot of the structures I've encountered are missing the initiator methionine so they start at residue 2.
For example: It re-ordered the residue numbers for another B subunit structure (1EI1 http://www.rcsb.org/pdb/explore/explore.do?structureId=1ei1) but left the chain ID's alone.
PDBfixer definitely re-numbers my residues but won't always change my chain ID's. May just be due to the way they were ordered in the PDB. I'd have to go back and check.
For example, I'm building an A2B2 heterotetramer (2 A and 2 B subunits), where the structure is composed of two co-crystallographic AB fragments (PDB: 4PLB http://www.rcsb.org/pdb/explore/explore.do?structureId=4PLB) that are aligned to a cryo-EM map. This contains 4 chains, where they are named:
A1 = A B1 = B A2 = C B2 = D
It re-ordered the residues to start at 1 and switched the A and B chains, but left C and D alone. Didn't realize it until I tried using it to model a missing loop between this structure and its CTD fragment (PDB 3LV6) and it tried telling me the seqres files didn't match the coordinates. This is a pretty modified pdb so I'm not sure it'd make the best test case, but can send it if you like.
On 9 Aug 2014, at 19:20, John Chodera wrote:
Currently, pdbfixer spits out PDB files renumbered to start from 1 and chain id A. This is probably not what we want, since pdbfixer is just supposed to make it possible to simulate the system while retaining as much information as the user desires, rather than changing information like residue numbers and chain ids.
Reply to this email directly or view it on GitHub: https://github.com/SimTk/pdbfixer/issues/54
Angelica C. Parente PhD Candidate, Biophysics Program Bryant and Pande Labs Stanford University http://pande.stanford.edu http://web.stanford.edu/group/bryant/
Thanks for this problem case example, @aparente!
I think the proper behavior should be to retain the original chain names and residue numberings, unless an API method renumberResidues()
and reorderChains()
(or equivalent GUI options) are used.
@peastman: Would you agree? If so, I can try to work out a PR. We need this capability ASAP for a project.
Also, @peastman, I'd like to augment the Topology
object with additional metadata for chain names, since this seems to be the main impediment to making this PR easy right now.
Sounds fine on both counts.
Thanks!
Confirmed that PDBfixer re-orders chains based on what order the chains are originally listed.
@aparente : Do you mean that the original chain order is preserved (even if non-alphabetical) when calling reorderChains()
?
This was using the broswer based app, I was just using it to replace missing heavy atoms, wasn't using the API. The original order of chains was "B A C D" and it renamed my A and B chains such that the order was "A B C D". So it keeps the order in which the chains are listed in the pdb, but re-names them so they are in alphabetical order. Was just confirming something I pointed out in my original comment.
Thanks for confirming, @aparente! I'll take a stab at fixing this, but it will probably have to wait until after @peastman's OpenMM 6.1 release feature freeze to get merged in.
This would be a good feature to include in a first release.
Did @peastman already make changes to have PDBFixer retain residue numbering? I'm noting issues with this...
No, I haven't done this.
What would it take to do this?
Any thoughts here? We could really use this.
This is more challenging that it seems at first. Somehow we need to guarantee that we're outputting valid, consistent identifiers. Atom and residue ids need to be numeric, positive, in order, no more than 4 (for residues) or 5 (for atoms) digits, and unique. Chain ids need to be single upper case letters, and again must be unique. Usually the initial ids you load from the input file will satisfy that, but as soon as you start doing any editing of the topology it becomes a lot harder. And even the input ids won't necessary be valid. For example, if you load a PDBx file, the chain ids will be arbitrary strings. (I know, I haven't implemented PDBx input yet, but it's one of the next things on my list.)
Right now the way it handles this is for the PDB writer to just ignore existing ids and generate new, guaranteed valid ones while writing the output file.
Right now the way it handles this is for the PDB writer to just ignore existing ids and generate new, guaranteed valid ones while writing the output file.
This comes at the price of a huge amount of information loss and potentially making the output unable to be analyzed in a useful way.
I agree that there are challenges, but it seems like adding a mode that does the following would be at least conceptually straightforward:
SEQRES
section, number them sequentially from the previous residue (throwing an Exception if this causes something weird to happen), and use the same chain identifier as the previous residueOur immediate use case is that we want to generate a bunch of mutations starting from the same structure, but the initial PDB file (from the RCSB) is missing a lot of residues. If we use PDBFixer to model the mutations and build in the missing residues for each mutation, each instance ends up with the loops modeled in a different random way. If we instead first build in the loops, we destroy the information about residue numbering and would have to figure out how things got re-mapped to make the mutations. In either case, the resulting file becomes much harder to analyze because the canonical numbering is lost.
Currently, pdbfixer spits out PDB files renumbered to start from 1 and chain id A. This is probably not what we want, since pdbfixer is just supposed to make it possible to simulate the system while retaining as much information as the user desires, rather than changing information like residue numbers and chain ids.