PDBFixer cannot handle nucleic acids with residue name N

openmm / pdbfixer

PDBFixer fixes problems in PDB files

Other

443 stars 112 forks source link

PDBFixer cannot handle nucleic acids with residue name N #281

Open jamesmkrieger opened 8 months ago

jamesmkrieger commented 8 months ago

   fixer.addMissingAtoms()
  File "/home/jkrieger/software/miniconda/envs/prody-github/lib/python3.9/site-packages/pdbfixer/pdbfixer.py", line 902, in addMissingAtoms
    (newTopology, newPositions, newAtoms, existingAtomMap) = self._addAtomsToTopology(True, True)
  File "/home/jkrieger/software/miniconda/envs/prody-github/lib/python3.9/site-packages/pdbfixer/pdbfixer.py", line 400, in _addAtomsToTopology
    self._addMissingResiduesToChain(newChain, insertHere, startPosition, endPosition, loopDirection, residue, newAtoms, newPositions, firstIndex)
  File "/home/jkrieger/software/miniconda/envs/prody-github/lib/python3.9/site-packages/pdbfixer/pdbfixer.py", line 511, in _addMissingResiduesToChain
    template = self.templates[residueName]
KeyError: 'N'

This was triggered by 7s7b.cif downloaded from the PDB

peastman commented 8 months ago

That's one I haven't seen before. What is N supposed to mean? Is this file trying to use the nucleotide sequence search codes where N means, "Accept any nucleotide at this position?"

jamesmkrieger commented 8 months ago

Perhaps it means they don’t know what nucleotide is there because they don’t have enough resolution

peastman commented 8 months ago

Maybe, but the sequence in a PDB file is supposed to be a real sequence, not IUPAC codes. Oh well, I guess someone has figured out yet another way to make a messed up PDB file!

What should we do in this situation? You've told it to add missing residues based on the sequence. But since the sequence doesn't tell us what to add at that position?

jamesmkrieger commented 8 months ago

Yeah, it is another strange thing to have

Perhaps raise a warning and skip that residue rather than completely stopping?

sukritsingh commented 1 month ago

Out of curiosity I dug through the source paper and it's not it's not even clear that it's just a single nucleotide being skipped (truly seems a bit of a sloppy entry).

I think skipping on "N" seems a bit dangerous because it's not even clear how many nucleotides are supposed to take it's place. Perhaps the safest bet is to just throw a more clear error message indicating that the entry has unrecognized single-letter-codes for nucleotides, and indicate the index where that happens?