Closed MauriceKarrenbrock closed 4 years ago
The fix for the HETATM issue is in https://github.com/openmm/openmm/pull/2572. It doesn't try to remember what was in the input file. Instead it selects the label based on the definition of heterogens.
The chain ID part is in https://github.com/openmm/openmm/pull/2575.
Hello, thank you very much for your availability ! The HETATM part is perfect!
But I am not sure about the chain part. In fact now the _atom_site.auth_asym_id
has been overwritten by the _atom_site.label_asym_id
when actually both information need to be kept because they are both important.
In a protein we might have more chains and every chain might have a polypeptide chain, some metallic ions, some organic ligands and some solvent molecules. And it is usually necessary to be able to both distinguish between different chains and between different molecules inside chains.
Before https://github.com/openmm/openmm/pull/2575 you could distinguish between chains but not between different molecules inside a chain, and after it you can distinguish between molecules but without knowing from which chain they came from.
For example in the 5aol mmCIF file you find 2 chains A and B. In chain A there is
_atom_site.label_asym_id
= A _atom_site.auth_asym_id
= A)_atom_site.label_asym_id
= C _atom_site.auth_asym_id
= A)_atom_site.label_asym_id
= D _atom_site.auth_asym_id
= A)_atom_site.label_asym_id
= F _atom_site.auth_asym_id
= A).In chain B there is
_atom_site.label_asym_id
= B _atom_site.auth_asym_id
= B)_atom_site.label_asym_id
= E _atom_site.auth_asym_id
= B)_atom_site.label_asym_id
= G _atom_site.auth_asym_id
= B)But after pdbfixer all the _atom_site.auth_asym_id
are equal to _atom_site.label_asym_id
making it impossible to know from which chain this molecule comes from.
So if I want to make a molecular dynamics on chain A to calculate the binding affinity of UFV with the polypeptide chain I need to be able to select all the things with _atom_site.auth_asym_id
= A and to create a correct topology file I need to know where a molecule ends and a new one begins (by checking _atom_site.label_asym_id
), this is very important for example in gromacs that relies heavily on what is written in the structure file.
But with https://github.com/openmm/openmm/pull/2575 I get an output file that has a complete information on where a molecule ends, but I don't know anymore things like which ZN ion is binding with which polypeptide chain, with which chain UFV is binding ecc.
So both information should be kept and written in the output file because they are both vital for the rest of the process to get a MD simulation done.
That isn't actually what the fields mean. Here's how the spec defines _atom_site.auth_asym_id
:
An alternative identifier for _atom_site.label_asym_id that may be provided by an author in order to match the identification used in the publication that describes the structure.
You may have found a particular file that happens to use the fields in the way you described, but it's totally non-standardized. Authors are free to use them in any way they want.
I see what you mean.
And yes the _atom_site.label_asym_id
definition that can be found on the spec is exactly what is happening now.
It is a pointer to _struct_asym.id
The value of _struct_asym.id must uniquely identify a record in the STRUCTASYM list. Note that this item need not be a number; it can be any unique identifier. loop _struct_asym.id _struct_asym.entity_id _struct_asym.details A 1 'one monomer of the dimeric enzyme' B 1 'one monomer of the dimeric enzyme' C 2 'one partially occupied position for the inhibitor' D 2 'one partially occupied position for the inhibitor'
But the problem is that in any case we still need something that gives us an information about the "pdb chain" value, telling us which ligand (metallic ion, solvent etc) interacts with which polypeptide chain. And something that takes the place of the "pdb TER line" whose definition is:
The TER records occur in the coordinate section of the entry, and indicate the last residue presented for each polypeptide and/or nucleic acid chain for which there are determined coordinates. For proteins, the residue defined on the TER record is the carboxy-terminal residue; for nucleic acids it is the 3'-terminal residue.
So I guess that, probably in an unofficial way, the _atom_site.auth_asym_id
has become the way authors do tell the "reader" about the "pdb chain" value. This is of course 100% empirical but I have been using tons of mmCIF files from the wwpdb in the last months and I never found one that didn't follow this pattern. (The only strange thing is that sometimes the _atom_site.auth_asym_id
doesn't start from A but from some other random letter).
And actually if you check the pdb to mmcif table you will see that _atom_site.auth_asym_id
is considered as the Strand_ID in pdb files, and the Strand_ID should be the "pdb chain" value (if I'm not wrong). While there is no correspondence for _atom_site.label_asym_id
.
So I guess that using _atom_site.auth_asym_id
as "pdb chain" value could be considered as an "almost standard" way of solving the problem.
In any case I am afraid that there will never be a clean solution to this thing as the standard is never completely followed and isn't always easy to interpret but to do MD simulations it is necessary to have both "pdb chain" like and "pdb TER line" like information.
Hello,
pdbfixer version : latest version on github openmm version : omnia-devel git_revision
36b6caa67e5f485023ec12d40e12f1700e19f618
If a mmCIF with both ATOMs and HETATMs is parsed and fixed all the HETATMs become ATOMs and the _atom_site.label_asym_id is overwritten by _atom_site.auth_asym_id loosing any information about when a molecule ends and another one begins in the same chain (like if TER lines were missing in a PDB file).
both the mmCIF files are atached as TXT files 2gz7_repaired.txt 2gz7.txt
Thank you very much in advance and have a nice day!