haddocking / pdb-tools

A dependency-free cross-platform swiss army knife for PDB files.
https://haddocking.github.io/pdb-tools/
Apache License 2.0
384 stars 113 forks source link

`pdb_tidy` removes the `TER` record between chains and removes last `ENDMDL` in a multi-model PDB #155

Open rvhonorato opened 1 year ago

rvhonorato commented 1 year ago

Describe the bug pdb_tidy removes the TER record between chains and removes last ENDMDL in a multi-model PDB.

To Reproduce

  1. test.pdb

    MODEL        1
    ATOM      1    N THR A   1      17.047  14.099   3.625  1.00 13.79       N  
    TER       2      THR A   1
    ATOM      3    N THR B   1      11.047  11.099  11.625  0.00  0.00       N  
    TER       4      THR B   1
    ENDMDL
    MODEL        2
    ATOM      1   CA ARG A  10       8.496   4.609   8.837  1.00  3.38       C  
    TER       2      ARG A  10
    ATOM      3   CA ARG B  10      22.496  22.609  22.837  1.00  3.38       C  
    TER       4      TPO B 197
    HETATM    5    N TPO B 197      21.891   2.133 -14.748  1.00 38.81       N  
    TER       6      TPO B 197
    ENDMDL
  2. pdb_tidy test.pdb > tidy.pdb

  3. $ cat tidy.pdb
    MODEL        1
    ATOM      1    N THR A   1      17.047  14.099   3.625  1.00 13.79       N
    ATOM      3    N THR B   1      11.047  11.099  11.625  0.00  0.00       N
    TER       4      THR B   1
    ENDMDL
    MODEL        2
    ATOM      1   CA ARG A  10       8.496   4.609   8.837  1.00  3.38       C
    TER       2      ARG A  10
    ATOM      4   CA ARG B  10      22.496  22.609  22.837  1.00  3.38       C
    TER       5      ARG B  10
    HETATM    7    N TPO B 197      21.891   2.133 -14.748  1.00 38.81       N
    END
  4. diff test.pdb tidy.pdb
    1,14c1,12
    < MODEL        1
    < ATOM      1    N THR A   1      17.047  14.099   3.625  1.00 13.79       N
    < TER       2      THR A   1
    < ATOM      3    N THR B   1      11.047  11.099  11.625  0.00  0.00       N
    < TER       4      THR B   1
    < ENDMDL
    < MODEL        2
    < ATOM      1   CA ARG A  10       8.496   4.609   8.837  1.00  3.38       C
    < TER       2      ARG A  10
    < ATOM      3   CA ARG B  10      22.496  22.609  22.837  1.00  3.38       C
    < TER       4      TPO B 197
    < HETATM    5    N TPO B 197      21.891   2.133 -14.748  1.00 38.81       N
    < TER       6      TPO B 197
    < ENDMDL
    ---
    > MODEL        1
    > ATOM      1    N THR A   1      17.047  14.099   3.625  1.00 13.79       N
    > ATOM      3    N THR B   1      11.047  11.099  11.625  0.00  0.00       N
    > TER       4      THR B   1
    > ENDMDL
    > MODEL        2
    > ATOM      1   CA ARG A  10       8.496   4.609   8.837  1.00  3.38       C
    > TER       2      ARG A  10
    > ATOM      4   CA ARG B  10      22.496  22.609  22.837  1.00  3.38       C
    > TER       5      ARG B  10
    > HETATM    7    N TPO B 197      21.891   2.133 -14.748  1.00 38.81       N
    > END

Expected behavior

The TER records between the chains should be kept and the last ENDMDL kept

Desktop (please complete the following information):

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:        22.04
Codename:       jammy
$ python --version
Python 3.11.2
$ pip show pdb-tools
Name: pdb-tools
Version: 2.5.0
Summary: A swiss army knife for PDB files.
Home-page: http://bonvinlab.org/pdb-tools
Author: Joao Rodrigues
Author-email: j.p.g.l.m.rodrigues@gmail.com
License: Apache Software License, version 2
Location: /home/rodrigo/.pyenv/versions/3.11.2/lib/python3.11/site-packages
Requires:
Required-by:
joaomcteixeira commented 1 year ago

note:

If we repeat the first line to simulate having two atoms before the first TER, the TER is not removed.

The same does not happen with the HETATM entry.

JoaoRodrigues commented 1 year ago

Thanks for the report @rvhonorato, we'll have a look.

rvhonorato commented 1 year ago

This is probably an edge case since the test pdb is not realistic and it works for "real" structures - anyway could be an indicative of some underlying issue.

Let me know if there's anyway I can help

JoaoRodrigues commented 1 year ago

I had a look at the format specification and it seems to hint that TER statements do not apply after HETATM. Only at the terminus of a (linked) chain. Checking a couple of random PDBs does reinforce that:

rvhonorato commented 1 year ago

Its indeed not very clear, looking at https://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#TER

Every chain of ATOM/HETATM records presented on SEQRES records is terminated with a TER record.

and https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html

indicates the end of a chain of residues. For example, a hemoglobin molecule consists of four subunit chains that are not connected. TER indicates the end of a chain and prevents the display of a connection to the next chain.

And deeper into the SEQRES record: https://www.wwpdb.org/documentation/file-format-content/format33/sect3.html#SEQRES

SEQRES records contain a listing of the consecutive chemical components covalently linked in a linear fashion to form a polymer. The chemical components included in this listing may be standard or modified amino acid and nucleic acid residues. It may also include other residues that are linked to the standard backbone in the polymer. Chemical components or groups covalently linked to side-chains (in peptides) or sugars and/or bases (in nucleic acid polymers) will not be listed here.

So that seems to imply to me that there is some relation between TER and SEQRES. Since the pdbs might not have this SEQRES to pull the limits from, its probably ok follow the convention of always having TER between chains of ATOM and additionally a TER between chain breaks (non-continuous numbering in ATOM) using the strict options, which I think already exists, right?

amjjbonvin commented 1 year ago

Yes - better too few than too many TER statements.

Adding TER statement at any chain break (even within a chain is a dangerous thing since it implies there is a real end of the chain there - meaning some software will interpret it as there should be a charged termini)

and additionally a TER between chain breaks (non-continuous numbering in ATOM) using the strict options, which I think already exists, right?

rvhonorato commented 1 year ago

Software that interprets the PDB format should cross-relate the TER records and the SEQRES to decide if its the true break or not - but its unlikely that this behaviour covers PDBs obtained from non-experimental methods, in that case (older) tools might just indeed assume its the OXT.

+1 for less TER in the sake of compability - but still the bug above is still relevant

amjjbonvin commented 7 months ago

Any news on this? Is it still relevant or implemented already?