Closed danpf closed 5 years ago
I know we don't follow a defined style guide, other than trying to be consistent.
To be consistent, please use two spaces instead of tab for indentation.
Edit: Actually I see 5 files with 2-space indentation and 5 with 4-space indentation. Tie :confused:
I at least made it consistent with the rest of the file.
I also added optional parameter to use chemCompType
in the check. and warning at declaration comments of old is_hetatm function.
I did a bit of cleanup. Mainly because the documentation was non-existing. Please check if what I wrote makes sense.
While trying to make sense of the code, I also saw that the tests didn't really check for any hetatms and I extended the tests accordingly.
Edit: if anyone knows of an mmCIF file which mixes HETATM within polymer chains, please let me know as it might be an interesting test case (not sure though if this can happen at all). The test case added here has 2 polymer chains full of non-hetatms and plenty of other chains with all hetatms (the PDB file has them mixed together but that's irrelevant for the MMTF file).
Finally, I updated the code in the repo to use the new recommended is_hetatm function. There, I also realized that we have plenty of different code that does PDB-like output, but this is out of scope for this PR I would say...
Anyway, if you are ok with my changes, I will merge this tomorrow.
One question: Wouldn't a is_polymer()
function be more useful? Why overload is_hetatm()
? The is_polymer()
function would have one argument and one responsibility less. User code would look like this:
if (is_hetatm(group.chemCompType) || is_polymer(chainIndex, entityList)) {
...
}
@speleo3 Yeah I thought about that too. Couldn't really see a major advantage in having the is_polymer
function though and given the way HETATM is used in mmCIF files, one should always use the "hetatm or !polymer" logic. As a result it seemed good to have an is_hetatm
function which does the "hetatm or !polymer" logic directly.
Or as an alternative suggestion: how about having
bool is_hetatm(chainIndex, entityList, chemCompType) {
return !is_peptide_linking(chemCompType) || !is_polymer(chainIndex, entityList)
}
Here I would rename the old is_hetatm
function and add a new is_polymer
function using the new code. This would remove any confusion of having multiple is_hetatm
functions with conflicting meanings. It would break backward-compatibility though...
I like your alternative suggestion, the responsibility of the individual functions is much clearer.
I'm a strong advocate for backward-compatibility, but in this case I think we're talking about a function which is mostly (only?) used for debugging and demoing? So I wouldn't mind not being backwards compatible. Also, keeping backward-compatibility would be easy, just keep the old overload, print a deprecation warning, and call !is_peptide_linking()
.
additional note regarding debugging/demoing role of this function: In efficient production code, I would pre-index the polymer check (map chainindex -> is-polymer), not iterate over entities and chains for every atom.
Actually looking at the old is_hetatm
function, it doesn't seem to fit my proposed new name (e.g. there are 'D-PEPTIDE LINKING' groups too). At the end of the day, the function does define if a group type is made up of HETATMs. So maybe we should keep the old signature and just clarify the doc. Either way, I will give it a shot and add the is_polymer
function.
For the efficiency: I doubt that there is production usecase where one needs to know HETATM (e.g. see the mmCIF description here: http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx.dic/Items/_atom_site.group_PDB.html )
And since we are being thorough: We do not really map HETATM correctly for modified amino acids. E.g. https://files.rcsb.org/view/5I1P.cif vs https://mmtf.rcsb.org/v1.0/full/5I1P (B3K is HETATM in cif and our logic would not see it as HETATM, the D-peptides in the file are ok)
So in short: I will add the is_polymer
and stop the exercise there unless there is a real usecase for this.
you mean is_polymer
, not is_protein
, right?
Oops yes. I edited my comments accordingly.
thanks for the changes! lgtm
pretty simple stuff here. Makes us more in line with https://github.com/rcsb/mmtf/issues/28
I sadly noticed the previous implementation would fail with '3zqs' because it has a HETATM proline at the end.