openmm / pdbfixer

PDBFixer fixes problems in PDB files
Other
453 stars 114 forks source link

Residue number strange behavior #201

Open MauriceKarrenbrock opened 4 years ago

MauriceKarrenbrock commented 4 years ago

Hello,

I found a strange behavior when repairing the 6w02 from the wwPDB (this happens both for the PDB and mmCIF file, the example below is for the PDB file):

When i repair the 6w02 protein the first lines of the repaired PDB look like this:

ATOM      1  N   SER A9998     -22.468 -28.218   2.850  1.00  0.00           N
...

ATOM      7  N   ASN A9999     -20.179 -26.195   2.240  1.00  0.00           N
...

ATOM     15  N   ALA A   0     -17.786 -23.244   1.425  1.00  0.00           N
...

ATOM     20  N   GLY A   1     -14.267 -21.382   1.160  1.00  0.00           N

As you can see the first residue starts from number 9998 an then this brings in having a residue 0 (zero), and this happens for both chain A and B. And as the original PDB starts from residue 4 it doesn't make much sense.

And having this 2 residue 0 (chain A and chain B) gives big problems when dealing with the protein structures with tools like Biopython

import pdbfixer
import simtk.openmm.app

input_file_name = pdb6w02.pdb
output_file_name = output6w02.pdb

with open(input_file_name, 'r') as f:
    fixer = pdbfixer.PDBFixer(pdbfile = f)

    fixer.findMissingResidues()

    fixer.findNonstandardResidues()

    fixer.replaceNonstandardResidues()

    fixer.findMissingAtoms()

    fixer.addMissingAtoms()

with open(output_file_name, 'w') as f:
     simtk.openmm.app.PDBFile.writeFile(fixer.topology, fixer.positions, f, keepIds = True)

both the input and output files are attached as .txt files

Thank you very much and have a nice day

output6w02.txt pdb6w02.txt

peastman commented 4 years ago

This problem is intrinsic to the PDB format. It only gives four columns for the residue ID, which means a strictly compliant PDB file can never have more than 10,000 residues. It also only gives five columns for the atom ID, so you're limited to 100,000 atoms, and one column for the chain ID, which limits you to 26 chains (since chain IDs are supposed to be upper case letters).

Of course, people frequently try to write larger systems to PDB files, so a variety of non-compliant hacks get used to deal with that. Wrapping the IDs back to 0 is one of the more common ones.

The real solution, though, is to write to a PDBx/mmCIF file instead. It's the successor to the PDB format, and it fixes these problems and many others. Just change PDBFile to PDBxFile.

MauriceKarrenbrock commented 4 years ago

I see, but as said, even if I didn't put it in the example, the exact same thing happens when using the mmCIF/PDBx file. This means that the problem is format independent. And in any case it would still make no sense as the 6w02 protein does only have few hundred residues, and the residues labeled as 9998 and 9999 are the first and the second ones and not the last ones.

This problem happened only with this specific protein, so I guess that it might be a "patological" situation but it could reveal some kind of sneaky bug. And as the 6w02 is a protein of the SARS Cov-2 virus many other researchers could benefit from understanding why pdbfixer is behaving like this.

Here is the .cif file with the exact same problem: 6w02_test_output.txt

Thank you very much for your time and have a nice day.