openmm / pdbfixer

PDBFixer fixes problems in PDB files
Other
443 stars 112 forks source link

Error when having more than 10 chain identifiers #287

Closed FAOlivieri closed 2 weeks ago

FAOlivieri commented 3 months ago

I have a system with 9 chains. After solvating with pdbfixer, I end up with 11 chains (chain ids are 1 to 11). When pdbfixer gets to the following line:

if len(chains) == 1:
    chains[0].id = 'A'
else:
    chains[-1].id = chr(ord(chains[-2].id)+1)

it throws an error, because chains[-2].id is not parseable to 1 character. I fixed it by changing the code to

if len(chains) == 1:
    chains[0].id = 'A'
else:
    try:
        int(chains[-2].id)
    except:
        chains[-1].id = chr(ord(chains[-2].id)+1) #what happens if chain ID is more than one character? They can be two characters
    else:
        chains[-1].id = str(int(chains[-2].id)+1)

and now it works properly but what exactly is this part of the code trying to achieve? As far as I understand it's assigning to the last chain the chain ID that follows the previous one, but why is that necessary? In most systems that will be true by default. Also, as per the PDB specification, chain ID can be 2 characters, so the existing code will fail on 2 letters as well.

Couldn't this whole IF be deleted?

peastman commented 3 months ago

Chain IDs are only allowed to be a single character. Usually they're letters. See the PDB spec.

FAOlivieri commented 2 months ago

Ok. Could you clarify what the line in my post is trying to achieve?

peastman commented 2 months ago

It increments the letter. If the last chain is A, the next one is called B.

murfalo commented 3 weeks ago

Hey Peter,

I'm running into the same issue. This seems to occur because PDBFixer uses OpenMM's modeller.addSolvent, which in turn uses topology.addChain() in a number of locations without specifying an ID:

# ...creating solvent chain...
newChain = newTopology.addChain() # modeller.py line 571
# ...
# ...creating ion chain...
ionChain = modeller.topology.addChain() # modeller.py line 373

(from OpenMM modeller.py line 571 and line 373)

In these cases, the ID defaults to a string representing its index:

    def addChain(self, id=None):
        """Create a new Chain and add it to the Topology.

        Parameters
        ----------
        id : string=None
            An optional identifier for the chain.  If this is omitted, an id is
            generated based on the chain index.

        Returns
        -------
        Chain
             the newly created Chain
        """
        if id is None:
            id = str(len(self._chains)+1)
        chain = Chain(len(self._chains), self, id)
        self._chains.append(chain)
        return chain

(from OpenMM topology.py lines 120-138)

This ID will be as many characters as the index is digits, often more than a single character. In these cases, PDBFixer will throw the error seen above (i.e., ord("10") is undefined, as ord takes only single-character arguments).

Updating OpenMM to use an alphabetic chain descriptor would be straightforward, but would limit addChain to 26 chains. Maybe updating PDBFixer would be the better approach? Let me know either way and I'd be happy to contribute a fix!

Thanks, Murphy

FAOlivieri commented 3 weeks ago

Changing it to work with letters would definitely work, and would be better than my improvised solution to parse numbers and letters. Most software is already limited to 26 chain IDs anyway, so it's doubtful that it will introduce a limitation that wasn't already there

peastman commented 3 weeks ago

The important thing is that it not crash. When you actually get around to saving the structure, writeFile() defaults to keepIds=False, which means it will generate new chain IDs anyway. We just need to keep it from crashing before it gets to that point. Also, the limit of 26 chains only applies to PDB format. PDBx/mmCIF allows multiple characters for the chain ID.

How about this as a behavior?

peastman commented 2 weeks ago

Fixed by #294.