openmm / pdbfixer

PDBFixer fixes problems in PDB files
Other
498 stars 117 forks source link

Should PDBFixer write out SEQRES entries if provided in the input? #293

Open sukritsingh opened 5 months ago

sukritsingh commented 5 months ago

I recently noticed that PDBFixer does not write out SEQRES entries in the output of a PDB file when the input PDB file does have SEQRES entries. Should this be expected behavior, and is there a way we can preserve those SEQRES entries?

Example described below: If you take a fresh structure file directly from the PDB (like 2WGJ - MET kinase), you'll see that the PDB file has SEQRES entries (full lines omitted for clarity):

SEQADV 2WGJ LEU A 1272  UNP  P08581    VAL  1272 CONFLICT                       
SEQRES   1 A  306  MET VAL HIS ILE ASP LEU SER ALA LEU ASN PRO GLU LEU          
SEQRES   2 A  306  VAL GLN ALA VAL GLN HIS VAL VAL ILE GLY PRO SER SER          
.
.
.     
SEQRES  24 A  306  HIS HIS HIS HIS HIS HIS HIS    

If I run it through the following code:

from simtk.openmm import app
# Fix the raw structure
from pdbfixer import PDBFixer
fixer = PDBFixer(filename=f'./2wgj.pdb')
fixer.findMissingResidues()
fixer.findMissingAtoms()
fixer.addMissingAtoms()
app.PDBFile.writeFile(fixer.topology, fixer.positions, open('pdbfixer.pdb', 'w'), keepIds=True)

Then the top of the PDBFile output pdbfixer.pdb no longer contains the SEQRES entries:

(openmm) 10 singhs15@LSKI3394:~/Downloads$ head pdbfixer.pdb 
REMARK   1 CREATED WITH OPENMM 8.0, 2024-06-05
CRYST1   76.926   94.192   45.990  90.00  90.00  90.00 P 1           1 
ATOM      1  N   MET A1050       6.726 110.138   6.298  1.00  0.00           N  
ATOM      2  CA  MET A1050       8.044 109.513   6.101  1.00  0.00           C  
ATOM      3  CB  MET A1050       8.343 110.956   6.690  1.00  0.00           C  
ATOM      4  CG  MET A1050       8.884 111.282   8.122  1.00  0.00           C  
ATOM      5  SD  MET A1050       9.676 112.738   8.895  1.00  0.00           S  
ATOM      6  CE  MET A1050       8.218 113.578   9.573  1.00  0.00           C  
ATOM      7  C   MET A1050       7.381 108.107   6.327  1.00  0.00           C  
ATOM      8  O   MET A1050       6.565 108.569   5.512  1.00  0.00           O  

None of this is mission critical but I was wondering if there's a way to ensure PDBFixer preserves the SEQRES information when fixing residues? Seems like that should be something PDBFixer should be able to do. Tagging @jchodera who was originally discussing this with me and he suggested I open an issue thread

peastman commented 5 months ago

It's an interesting question. It's certainly possible to write it out. The question is what exactly we should write?

The spec is a bit confusing about this. It says, "The residues presented in the ATOM records must agree with those on the SEQRES records." But of course, the SEQRES often contains extra residues not present in the ATOM records. If you chose to mutate any residues, clearly we should write out the mutated sequence. If you added any missing residues, we'll of course write them out. But what if there were missing residues that you chose not to add? Should we list them or not? What if you chose to build a missing loop, but with a different sequence than what was in the original SEQRES?

There's also the issue that matches chains between SEQRES and ATOM records can be difficult. In principle you just look for matching chain IDs. In practice that often doesn't work correctly. And SEQRES doesn't list residue indices, so matching up the sequence of residues in a chain may not be clearly defined. In practice, we resolve both of these by doing a sequence alignment to identify which chains match and how they match.

sukritsingh commented 5 months ago

what if there were missing residues that you chose not to add?

I don't really see a usecase for this except for the termini of structures, to be honest. It would make no sense for a usecase of modeling a biological system to exclude a loop selectively (unless you were mutating the loop later, which would use new SEQRES residues anyways).

That said, my vote would be that SEQRES residues should be preserved if residues aren't added but SEQRES entries were provided. Ultimately this would mean that the SEQRES entries are as complete as possible in a biologically relevant manner.

What if you chose to build a missing loop, but with a different sequence than what was in the original SEQRES?

I think this is a good question and ties well into what I was envisioning:

  1. If a SEQRES entry is provided in a structure, and no mutations are made to the structure, we could just write out the SEQRES that was input
  2. This "preserved" SEQRES entry is what is used to fill in the missing loops/residues. This is already existing behavior; we would just also write out the SEQRES entries.
  3. If a residue is mutated, then the corresponding SEQRES entry would be edited and written out (ie ensuring that the SEQRES matches with the sequence of the structure written out).

SEQRES doesn't list residue indices, so matching up the sequence of residues in a chain may not be clearly defined.

This is a good question, and relevant to point 3 above in what I was envisioning. One potential idea: Assuming the SEQRES for the PDB construct is complete (which is true for PDB entries), wouldn't it be a safe assumption that you know exactly how many atoms you need to traverse in the ATOM records? Each amino acid has a fixed number of atoms, so you can just compute however many atoms you need to traverse to get to the record/amino acid of interest?

Likewise, if you made mutations and rewriting the SEQRES entries, one could simply extract every single unique C $\alpha$ atom and write them to the SEQRES entries in the order they are read?

I think a naive starting point would be, if a sequence is unmutated but simply fixed for missing residues, then SEQRES entries would be preserved and written out, if provided in the input.

peastman commented 5 months ago

wouldn't it be a safe assumption that you know exactly how many atoms you need to traverse in the ATOM records?

Suppose the SEQRES contains two chains: TYR-ALA-GLY and ALA-GLY-GLU. In the ATOM records you find a chain with ALA-GLY. Which one is it?

sukritsingh commented 5 months ago

Touché - never mind on that idea then, lol. I guess you'd have to do a sequence alignment.

Could you expand more on why in practice, matching Chain IDs doesn't work correctly? In chains with heterotrimers like 3AH8, each of the three chains has different chain letters (and would be sequentially consecutive chain IDs). Wouldn't it be reasonable to expect that "correctly provided SEQRES" entries contain all the necessary chain information to match to chain ID?

peastman commented 5 months ago

The one particular file you linked may having correctly matched up chain IDs, but it's not uncommon to find ones that don't.

sukritsingh commented 5 months ago

Ohhh ok so it's an issue where people put up sloppy put together SEQRES entries....that's harder... Ways I see forward off the top of my head (open to other folks contributing their own ways! @jchodera you may have thoughts!):

  1. We have some sort of boolean flag for whether or not SEQRES output should be written out (default is False) - if true then the input should also contain Chain IDs and an error is thrown (This basically assumes some degree of systems knowledge/ file familiarity competence and may be less "accessible" though)
  2. If the sequence alignment approach works for chain ID identification, maybe the simplest solution is to use that to ID the appropriate chain when making modifications? And then all that's left is writing out the SEQRES entries after mutation? It sounds ugly even writing it out, but just trying to spitball ideas
peastman commented 5 months ago

Let's try approaching it from the other direction. What is the goal? What is the problem you are trying to solve? Once we understand that we can consider what's the best solution, which might or might not involve writing SEQRES records.

sukritsingh commented 5 months ago

Sure! My ultimate use case is that I want to be able extract both primary sequence and have structural information in the PDB formatted file. I'd like to have any PDB file act as both the information of record for both an initial topology and the sequence of the construct (which can be parsed and passed to other tools for sequence alignment). This is particularly useful when I'm working with either:

  1. Multiple starting structures across many homologous proteins (kinases are a good example)
  2. Multiple variant constructs (ie clinical variants of a single starting protein) - most often these variants aren't in the PDB and found only in literature/clinics. Having the sequence contained within the relevant PDB file would also make it easy to identify which structures contain which variants.

Right now to extract primary sequence I would either have to traverse the ATOM (doable, but inefficient, and if I use a raw PDB file then ATOM records may be missing), or I have to use the PDB entry and follow links to Uniprot/other links (which is much more manual and I'd rather work online).

I imagine there are some alternative approaches to this but ultimately this makes book keeping across many structures easier, and allows me to select subsets of sequences generating multiple sequence alignments as desired.

To be clear, this is not mission critical! There are alternatives to achieving this information/record keeping - SEQRES preservation across files in and out of PDBFixer just seems like a good long term solution.

TLDR: The reason I think preserving SEQRES records is it becomes makes a PDB file a "one stop shop" for a protein sequence or construct file - useful when wrangling with multiple files/constructs.

jchodera commented 5 months ago

The goal is this: Preserve information in a PDB-compliant manner if available, since downstream processing tools may need it.

Here's my thinking:

The same philosophy could apply to other pieces of header information as well: