Atom names are misaligned in PDB files

jbarnoud commented 3 years ago

The atom names we write are fully left aligned while they should actually be split into 2 parts with the 2 first characters being the element and being right-aligned and the 2 last characters being the id and being left-aligned. We write CA.. while we should be writing .CA..

Here is a legit PDB:

ATOM      1  N   MET A   1      19.417  30.346  31.974  1.00 36.22           N
ATOM      2  CA  MET A   1      20.836  30.450  32.135  1.00 15.67           C
ATOM      3  C   MET A   1      21.323  31.688  32.833  1.00 22.07           C
ATOM      4  O   MET A   1      20.720  32.758  32.821  1.00 20.34           O
ATOM      5  CB  MET A   1      21.614  30.236  30.847  1.00 30.02           C
ATOM      6  CG  MET A   1      22.952  29.658  31.203  1.00 18.49           C
ATOM      7  SD  MET A   1      23.469  28.287  30.119  1.00 54.30           S
ATOM      8  CE  MET A   1      21.887  27.616  29.524  1.00 25.98           C
ATOM      9  N   ASN A   2      22.482  31.490  33.412  1.00  6.28           N
ATOM     10  CA  ASN A   2      23.160  32.494  34.122  1.00  7.90           C

Here is what we write:

ATOM      1 N    MET A   1      19.417  30.346  31.974  1.00 36.22          N
ATOM      2 CA   MET A   1      20.836  30.450  32.135  1.00 15.67          C
ATOM      3 C    MET A   1      21.323  31.688  32.833  1.00 22.07          C
ATOM      4 O    MET A   1      20.720  32.758  32.821  1.00 20.34          O
ATOM      5 CB   MET A   1      21.614  30.236  30.847  1.00 30.02          C
ATOM      6 CG   MET A   1      22.952  29.658  31.203  1.00 18.49          C
ATOM      7 SD   MET A   1      23.469  28.287  30.119  1.00 54.30          S
ATOM      8 CE   MET A   1      21.887  27.616  29.524  1.00 25.98          C
ATOM      9 N    ASN A   2      22.482  31.490  33.412  1.00  6.28          N
ATOM     10 CA   ASN A   2      23.160  32.494  34.122  1.00  7.90          C

pckroon commented 3 years ago

That's only half the story I think: An atom called HB13 (hydrogen number 3 attached to CB1) needs the four characters, and should according to this scheme be written as 3HB1. This raises 2 questions with me: 1) How sure are we other tools will parse this correctly (i.e. to atomname = HB13); 2) How should we parse this?

jbarnoud commented 3 years ago

The usual solution is to not trim the name when you read it. There are also rules: if possible, what I describe above applies, otherwise, it just takes the 4 characters. If you know the element (which we do) it is not very difficult. It is trickier when you do not know the element like in MDAnalysis: https://github.com/MDAnalysis/mdanalysis/blob/f542aa485983f8d3dd250b36a886061f696c3e97/package/MDAnalysis/coordinates/PDB.py#L997.

pckroon commented 3 years ago

What do you mean by "trim" in this case? Remove the spaces? Because we do need to do that to identify the atoms.

My main question is about what to do with atoms that have 3 identifiers, rather than 2, on both the reading and writing side.

jbarnoud commented 3 years ago

Atoms that have 3 identifiers are of 2 kinds: those where the whole name fits on 4 columns and those are easy, and those that do not fit in 4 columns and those are invalid.

By trim I meant keep the spaces indeed. It is the common way to solve the issue but we do not need to do that because we know the element so we can just apply the rule. It is a matter of 3 lines of code.

pckroon commented 3 years ago

... because we know the element...

This is not quite true. Although the PDB spec says the element field is obligatory, there are enough tools out there that omit it.

jbarnoud commented 3 years ago

This is even truer as we write CG structure where the element is not relevant. My current approach is to do the right thing if I have the element and just left-align if I don't.

pckroon commented 3 years ago

A "best effort" approach sounds like a good idea. There might also be cases where the element is not part of the atomname... We can do something similar with the parser side: if the atomname starts with a number and contains at least one letter, move the numbers to the end.

marrink-lab / vermouth-martinize

Atom names are misaligned in PDB files #345