rcsb / mmtf-cpp

The pure C++ implementation of the MMTF API, decoder and encoder.
MIT License
21 stars 24 forks source link

Add alternative is_hetatm method #26

Closed danpf closed 5 years ago

danpf commented 5 years ago

pretty simple stuff here. Makes us more in line with https://github.com/rcsb/mmtf/issues/28

I sadly noticed the previous implementation would fail with '3zqs' because it has a HETATM proline at the end.

speleo3 commented 5 years ago

I know we don't follow a defined style guide, other than trying to be consistent.

To be consistent, please use two spaces instead of tab for indentation.

Edit: Actually I see 5 files with 2-space indentation and 5 with 4-space indentation. Tie :confused:

danpf commented 5 years ago

I at least made it consistent with the rest of the file.

I also added optional parameter to use chemCompType in the check. and warning at declaration comments of old is_hetatm function.

gtauriello commented 5 years ago

I did a bit of cleanup. Mainly because the documentation was non-existing. Please check if what I wrote makes sense.

While trying to make sense of the code, I also saw that the tests didn't really check for any hetatms and I extended the tests accordingly.

Edit: if anyone knows of an mmCIF file which mixes HETATM within polymer chains, please let me know as it might be an interesting test case (not sure though if this can happen at all). The test case added here has 2 polymer chains full of non-hetatms and plenty of other chains with all hetatms (the PDB file has them mixed together but that's irrelevant for the MMTF file).

Finally, I updated the code in the repo to use the new recommended is_hetatm function. There, I also realized that we have plenty of different code that does PDB-like output, but this is out of scope for this PR I would say...

Anyway, if you are ok with my changes, I will merge this tomorrow.

speleo3 commented 5 years ago

One question: Wouldn't a is_polymer() function be more useful? Why overload is_hetatm()? The is_polymer() function would have one argument and one responsibility less. User code would look like this:

    if (is_hetatm(group.chemCompType) || is_polymer(chainIndex, entityList)) {
        ...
    }
gtauriello commented 5 years ago

@speleo3 Yeah I thought about that too. Couldn't really see a major advantage in having the is_polymer function though and given the way HETATM is used in mmCIF files, one should always use the "hetatm or !polymer" logic. As a result it seemed good to have an is_hetatm function which does the "hetatm or !polymer" logic directly.

Or as an alternative suggestion: how about having

bool is_hetatm(chainIndex, entityList, chemCompType) {
  return !is_peptide_linking(chemCompType) || !is_polymer(chainIndex, entityList)
}

Here I would rename the old is_hetatm function and add a new is_polymer function using the new code. This would remove any confusion of having multiple is_hetatm functions with conflicting meanings. It would break backward-compatibility though...

speleo3 commented 5 years ago

I like your alternative suggestion, the responsibility of the individual functions is much clearer.

I'm a strong advocate for backward-compatibility, but in this case I think we're talking about a function which is mostly (only?) used for debugging and demoing? So I wouldn't mind not being backwards compatible. Also, keeping backward-compatibility would be easy, just keep the old overload, print a deprecation warning, and call !is_peptide_linking().

speleo3 commented 5 years ago

additional note regarding debugging/demoing role of this function: In efficient production code, I would pre-index the polymer check (map chainindex -> is-polymer), not iterate over entities and chains for every atom.

gtauriello commented 5 years ago

Actually looking at the old is_hetatm function, it doesn't seem to fit my proposed new name (e.g. there are 'D-PEPTIDE LINKING' groups too). At the end of the day, the function does define if a group type is made up of HETATMs. So maybe we should keep the old signature and just clarify the doc. Either way, I will give it a shot and add the is_polymer function.

For the efficiency: I doubt that there is production usecase where one needs to know HETATM (e.g. see the mmCIF description here: http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx.dic/Items/_atom_site.group_PDB.html )

gtauriello commented 5 years ago

And since we are being thorough: We do not really map HETATM correctly for modified amino acids. E.g. https://files.rcsb.org/view/5I1P.cif vs https://mmtf.rcsb.org/v1.0/full/5I1P (B3K is HETATM in cif and our logic would not see it as HETATM, the D-peptides in the file are ok)

So in short: I will add the is_polymer and stop the exercise there unless there is a real usecase for this.

speleo3 commented 5 years ago

you mean is_polymer, not is_protein, right?

gtauriello commented 5 years ago

Oops yes. I edited my comments accordingly.

danpf commented 5 years ago

thanks for the changes! lgtm