kylelutz / chemkit

A C++ library for molecular modelling, cheminformatics and molecular visualization.
http://www.chemkit.org
BSD 3-Clause "New" or "Revised" License
54 stars 26 forks source link

Segmentation fault for PDB parsing of files with inconsistenst HELIX and ATOM specifications #38

Closed timostrunk closed 5 years ago

timostrunk commented 7 years ago

This commit fixes a few segmentation faults occuring with reading many PDB files found in the RCSB and especially with the TOP8000 conformer database.

The bug exists in the parsing of HELIX and SHEET lines. These lines reference residue ids in the PDB file, however the following line:

for(int residue = pdbConformation->firstResidue(); residue < pdbConformation->lastResidue(); residue++){ int internal_id = pdbChain->getResIdfromPdbResId(residue);

references an internal id, i.e. the order in which residues were found inside the PDB. The order in the SHEET and HELIX definitions can be a) discontinous -> segfault b) not actually inside the PDB -> segfault c) start from negative indices -> segfault Example: 3of4FH_C in the Top8000 set.

The solution I implemented: When parsing the PDB file, we need to keep track of the internal_id->PDB residue id mapping and vice-versa and use it in case. This solution cannot always work, especially in case of b). We cannot map the secondary structure to residues not inside the PDB. In cases a) and b) a new warning is therefore printed.

Using these changes, the pdb reader code is valgrind safe again for the parsing of the Top8000 set.

If required I can supply a minimal testcase and a few PDB files to test my commit.