Open sukritsingh opened 5 months ago
It's an interesting question. It's certainly possible to write it out. The question is what exactly we should write?
The spec is a bit confusing about this. It says, "The residues presented in the ATOM records must agree with those on the SEQRES records." But of course, the SEQRES often contains extra residues not present in the ATOM records. If you chose to mutate any residues, clearly we should write out the mutated sequence. If you added any missing residues, we'll of course write them out. But what if there were missing residues that you chose not to add? Should we list them or not? What if you chose to build a missing loop, but with a different sequence than what was in the original SEQRES?
There's also the issue that matches chains between SEQRES and ATOM records can be difficult. In principle you just look for matching chain IDs. In practice that often doesn't work correctly. And SEQRES doesn't list residue indices, so matching up the sequence of residues in a chain may not be clearly defined. In practice, we resolve both of these by doing a sequence alignment to identify which chains match and how they match.
what if there were missing residues that you chose not to add?
I don't really see a usecase for this except for the termini of structures, to be honest. It would make no sense for a usecase of modeling a biological system to exclude a loop selectively (unless you were mutating the loop later, which would use new SEQRES residues anyways).
That said, my vote would be that SEQRES residues should be preserved if residues aren't added but SEQRES entries were provided. Ultimately this would mean that the SEQRES entries are as complete as possible in a biologically relevant manner.
What if you chose to build a missing loop, but with a different sequence than what was in the original SEQRES?
I think this is a good question and ties well into what I was envisioning:
SEQRES doesn't list residue indices, so matching up the sequence of residues in a chain may not be clearly defined.
This is a good question, and relevant to point 3 above in what I was envisioning. One potential idea: Assuming the SEQRES for the PDB construct is complete (which is true for PDB entries), wouldn't it be a safe assumption that you know exactly how many atoms you need to traverse in the ATOM records? Each amino acid has a fixed number of atoms, so you can just compute however many atoms you need to traverse to get to the record/amino acid of interest?
Likewise, if you made mutations and rewriting the SEQRES entries, one could simply extract every single unique C $\alpha$ atom and write them to the SEQRES entries in the order they are read?
I think a naive starting point would be, if a sequence is unmutated but simply fixed for missing residues, then SEQRES entries would be preserved and written out, if provided in the input.
wouldn't it be a safe assumption that you know exactly how many atoms you need to traverse in the ATOM records?
Suppose the SEQRES contains two chains: TYR-ALA-GLY and ALA-GLY-GLU. In the ATOM records you find a chain with ALA-GLY. Which one is it?
Touché - never mind on that idea then, lol. I guess you'd have to do a sequence alignment.
Could you expand more on why in practice, matching Chain IDs doesn't work correctly? In chains with heterotrimers like 3AH8, each of the three chains has different chain letters (and would be sequentially consecutive chain IDs). Wouldn't it be reasonable to expect that "correctly provided SEQRES" entries contain all the necessary chain information to match to chain ID?
The one particular file you linked may having correctly matched up chain IDs, but it's not uncommon to find ones that don't.
Ohhh ok so it's an issue where people put up sloppy put together SEQRES entries....that's harder... Ways I see forward off the top of my head (open to other folks contributing their own ways! @jchodera you may have thoughts!):
False
) - if true then the input should also contain Chain IDs and an error is thrown (This basically assumes some degree of systems knowledge/ file familiarity competence and may be less "accessible" though)Let's try approaching it from the other direction. What is the goal? What is the problem you are trying to solve? Once we understand that we can consider what's the best solution, which might or might not involve writing SEQRES records.
Sure! My ultimate use case is that I want to be able extract both primary sequence and have structural information in the PDB formatted file. I'd like to have any PDB file act as both the information of record for both an initial topology and the sequence of the construct (which can be parsed and passed to other tools for sequence alignment). This is particularly useful when I'm working with either:
Right now to extract primary sequence I would either have to traverse the ATOM (doable, but inefficient, and if I use a raw PDB file then ATOM records may be missing), or I have to use the PDB entry and follow links to Uniprot/other links (which is much more manual and I'd rather work online).
I imagine there are some alternative approaches to this but ultimately this makes book keeping across many structures easier, and allows me to select subsets of sequences generating multiple sequence alignments as desired.
To be clear, this is not mission critical! There are alternatives to achieving this information/record keeping - SEQRES preservation across files in and out of PDBFixer just seems like a good long term solution.
TLDR: The reason I think preserving SEQRES records is it becomes makes a PDB file a "one stop shop" for a protein sequence or construct file - useful when wrangling with multiple files/constructs.
The goal is this: Preserve information in a PDB-compliant manner if available, since downstream processing tools may need it.
Here's my thinking:
SEQRES
header, it should store and refer to that block. This information is essential if we need to model missing residues of any kind, since this is where that information needs to come from.SEQRES
and ATOM/HETATM
records, there should be an Exception
. This is a non-compliant PDB file we cannot process.SEQRES
info should be updated if it existsSEQRES
info exists when writing the file, we write the SEQRES
header.The same philosophy could apply to other pieces of header information as well:
Exception
I recently noticed that PDBFixer does not write out SEQRES entries in the output of a PDB file when the input PDB file does have SEQRES entries. Should this be expected behavior, and is there a way we can preserve those SEQRES entries?
Example described below: If you take a fresh structure file directly from the PDB (like 2WGJ - MET kinase), you'll see that the PDB file has SEQRES entries (full lines omitted for clarity):
If I run it through the following code:
Then the top of the PDBFile output
pdbfixer.pdb
no longer contains the SEQRES entries:None of this is mission critical but I was wondering if there's a way to ensure PDBFixer preserves the SEQRES information when fixing residues? Seems like that should be something PDBFixer should be able to do. Tagging @jchodera who was originally discussing this with me and he suggested I open an issue thread