haddocking / pdb-tools

A dependency-free cross-platform swiss army knife for PDB files.
https://haddocking.github.io/pdb-tools/
Apache License 2.0
372 stars 113 forks source link

Add chain IDs on missing chains with TER #70

Closed joaomcteixeira closed 3 years ago

joaomcteixeira commented 3 years ago

Dear developers,

A user has requested that sequential chain IDs are added on PDBs that have TER statements delimiting chains but lack chain IDs. My question is on which tool should this be implemented. Initially, I thought about pdb_tidy but now I think that would grant too much power to tidy. Should a new one be created, pdb_completechains?

@JoaoRodrigues @brianjimenez @amjjbonvin @mtrellet

brianjimenez commented 3 years ago

I favor the option of a different tool, seems more in the line of the tools philosophy.

mtrellet commented 3 years ago

I do agree with you and @brianjimenez, a distinct tool would fit our philosophy better.

JoaoRodrigues commented 3 years ago

I don't understand the question. The user probably means pdb_chain assigns chain IDs to an entire molecule. If so, that's exactly the point. I understand it's a pain when you have no chains but TER records. I see a few different solutions to this problem:

  1. Create a pdb_splitTER to split a file solely by TER records and then let the user fix the chain IDs as they wish with pdb_chain. Note the capital TER not to have a tool with the ambiguous name pdb_splitter.
  2. Create a pdb_chainbow tool (I think I have one somewhere) that assigns chains whenever it finds a TER record. In addition, we could have a pdb_mkter to create TER records based on atom distances (on every chain break).

What would solve most use cases?

PS. Emoji explosion :)

amjjbonvin commented 3 years ago

I don't understand the question. The user probably means pdb_chain assigns chain IDs to an entire molecule. If so, that's exactly the point. I understand it's a pain when you have no chains but TER records. I see a few different solutions to this problem:

Create a pdb_splitTER to split a file solely by TER records and then let the user fix the chain IDs as they wish with pdb_chain. Note the capital TER not to have a tool with the ambiguous name pdb_splitter. hmmm… no convinced about this. If there are TER statements within the same chain it is for a good reason.

Create a pdb_chainbow tool (I think I have one somewhere) that assigns chains whenever it finds a TER record. In addition, we could have a pdb_mkter to create TER records based on atom distances (on every chain break). That’s dangerous for the reason above. Not in favour of it.

joaomcteixeira commented 3 years ago

Follow @amjjbonvin and @JoaoRodrigues comments,

I believe I do recall seeing PDBs with TER lines on backbone breaks within the same chain - so TERs would separate segments within chains rather than chains (just my memory, I don't recall a specific example). However, using the TER to separate segments is against the TER PDB specifications.

So, despite having an automatic chain adder on every TER may be dangerous, on the other side, having a pdb_splitTER could be safe, taking into consideration the user knows what s/he is doing. I find it useful. Good documentation on that needs to be provided in CAPS for none to get confused that pdb_splitTER is not pdb_splitchain.

I don't think it is appropriate a tool that adds TERs based on distance restraints. All pdb-tools functionalities are based just on formatting issues, adding one that calculates stuff goes slightly off the core design. Also, there is the burden of heavy calculations without falling outside the STD LIB. But this is me playing on the conservative side. :wink:

addition: I will wait for a consensus before creating a PR for this.

JoaoRodrigues commented 3 years ago

There's no perfect solution here :) We cannot distinguish between a broken chain and two separate molecules. The 'truth' is that TER records should only be present at the end of protein/nucleic acid chains:

  • Every chain of ATOM/HETATM records presented on SEQRES records is terminated with a TER record.
  • The TER records occur in the coordinate section of the entry, and indicate the last residue presented for each polypeptide and/or nucleic acid chain for which there are determined coordinates. For proteins, the residue defined on the TER record is the carboxy-terminal residue; for nucleic acids it is the 3'-terminal residue.

What about a pdb_addter tool that takes a -break option that adds TER statements at chain ID changes and, optionally (with -break), at gaps as well? This is not "against" the rules, we already do some simple calculations in pdb_gap :-)

We could then add the pdb_chainbows tool and basically rely on the user to know what they are doing. To be honest, I only found this scenario once in the real world - no chains but TERs - when converting very complex systems for simulations. I would assume the use cases would be similarly rare..

mtrellet commented 3 years ago

It's been a while since I last played with TER records of a PDB file but I do like the idea of @JoaoRodrigues with the pdb_addter and pdb_chainbows tools.

With proper documentation, this should tackle most of the issues we've raised here. And about the potential philosophy break, as it has been said above, we already do something very similar with pdb_gap.

joaomcteixeira commented 3 years ago

Okay, I will try to address this in the following days. :+1:

JoaoRodrigues commented 3 years ago

Looking at this today, I realized that we might just need to add a flag to pdb_tidy to produce proper PDB files (no TER within chains) and then add the pdb_chainbows to rename chains by TER statements. That's the minimal set of changes to have this functionality.