steineggerlab / foldcomp

Compressing protein structures effectively with torsion angles
GNU General Public License v3.0
152 stars 14 forks source link

Error compresing `PDB` #35

Open valentynbez opened 1 year ago

valentynbez commented 1 year ago

Hello,

I was trying to compress PDB and I constantly get the same error. I tried changing all extensions from .ent to .pdb and rewriting pdb's using ProDy, so that everything unnecessary is removed from the pdb itself.

Compressing files in correct_pdb using 32 threads
Output directory: pdb_foldcomp
terminate called after throwing an instance of 'std::out_of_range'
  what():  map::at
Aborted (core dumped)

If I try per-file compression, it only writes a single file and quits. It would also be nice to see what file is being processed, in case it's an error with pdb contents.

Cheers, V

khb7840 commented 1 year ago

Thanks for the feedback. I'll implement a verbosity option for logging error with processed file name. As initial foldcomp was designed to handle predicted structures without discontinuity, we haven't checked all the possible error cases from the real data. To check the cause of error, it would be helpful if you could share the preprocessing script to handle the PDB.

valentynbez commented 1 year ago

Thanks for the answer, I would be really grateful for help and I think having a foldcomp db of experimental structures gonna be awesome! I tried different possibilities, here is a snippet for my test data (https://www.rcsb.org/structure/7db5):

from prody import parsePDBStream, writePDB
from pathlib import Path
import re

file = "databases/pdb_structures/7db5.pdb"
outfolder = "."

file = Path(file)
filename = file.name
outfolder = Path(outfolder)
outfile = outfolder / filename

with open(str(file)) as f:
    pdb = parsePDBStream(f)

# get only first chain of the pdb file 
first_chain = [str(chain_id).split()[1] for chain_id in pdb.iterChains()][0]
with open(str(file)) as f:
    pdb = parsePDBStream(f, chain=first_chain)
writePDB(str(outfile), pdb)

# overwrite first line in the outfile
with open(str(outfile), "r") as f:
    lines = f.readlines()

# adding a TITLE, replacing a REMARK
lines[0] = "TITLE     " + filename.split(".")[0] + "\n"
with open(str(outfile), "w") as f:
    for i, line in enumerate(lines):
        f.writelines(line)