openmm / pdbfixer

PDBFixer fixes problems in PDB files
Other
498 stars 117 forks source link

PDBFixer cannot handle nucleic acids with residue name N #281

Open jamesmkrieger opened 1 year ago

jamesmkrieger commented 1 year ago
   fixer.addMissingAtoms()
  File "/home/jkrieger/software/miniconda/envs/prody-github/lib/python3.9/site-packages/pdbfixer/pdbfixer.py", line 902, in addMissingAtoms
    (newTopology, newPositions, newAtoms, existingAtomMap) = self._addAtomsToTopology(True, True)
  File "/home/jkrieger/software/miniconda/envs/prody-github/lib/python3.9/site-packages/pdbfixer/pdbfixer.py", line 400, in _addAtomsToTopology
    self._addMissingResiduesToChain(newChain, insertHere, startPosition, endPosition, loopDirection, residue, newAtoms, newPositions, firstIndex)
  File "/home/jkrieger/software/miniconda/envs/prody-github/lib/python3.9/site-packages/pdbfixer/pdbfixer.py", line 511, in _addMissingResiduesToChain
    template = self.templates[residueName]
KeyError: 'N'

This was triggered by 7s7b.cif downloaded from the PDB

peastman commented 1 year ago

That's one I haven't seen before. What is N supposed to mean? Is this file trying to use the nucleotide sequence search codes where N means, "Accept any nucleotide at this position?"

jamesmkrieger commented 1 year ago

Perhaps it means they don’t know what nucleotide is there because they don’t have enough resolution

peastman commented 1 year ago

Maybe, but the sequence in a PDB file is supposed to be a real sequence, not IUPAC codes. Oh well, I guess someone has figured out yet another way to make a messed up PDB file!

What should we do in this situation? You've told it to add missing residues based on the sequence. But since the sequence doesn't tell us what to add at that position?

jamesmkrieger commented 1 year ago

Yeah, it is another strange thing to have

Perhaps raise a warning and skip that residue rather than completely stopping?

sukritsingh commented 5 months ago

Out of curiosity I dug through the source paper and it's not it's not even clear that it's just a single nucleotide being skipped (truly seems a bit of a sloppy entry).

I think skipping on "N" seems a bit dangerous because it's not even clear how many nucleotides are supposed to take it's place. Perhaps the safest bet is to just throw a more clear error message indicating that the entry has unrecognized single-letter-codes for nucleotides, and indicate the index where that happens?