uw-ipd / tmol

TMol
Apache License 2.0
30 stars 3 forks source link

Aleaverfay/canonical ordering from chemdb #281

Closed aleaverfay closed 8 months ago

aleaverfay commented 11 months ago

This PR introduces an interface layer between tmol and other molecular modeling packages so that coordinates generated from one can be translated into a meaningful representation for the other. The principle mediator of this interface is the CanonicalOrdering class. This class is constructed with a set of allowable RefinedResidueType (RRT) objects and groups them together based on their "name3." The CanonicalOrdering object then collects the names of all atoms for all of the RRTs with the same name3 and gives an order for those atoms. (E.g. for "ALA", "N" might be atom 0 and "CA" might be atom 1, etc.). This allows the user to create a coords tensor of [n-poses x max-n-residues x max-n-canonical-atoms x 3] and to populate that tensor in a way that tmol will be able to interpret. If the user wants to provide a "HG" for SER (recommended!), then the CanonicalOrdering will tell the user where to put the HG's coordinate in that coords tensor.

With the CanonicalOrdering in hand, the user is able to construct an intermediate representation that tmol will use to construct its PoseStack object; this intermediate representation is called the "canonical form," a dictionary that will contain at least three things:

  1. "res_types": an [n-poses x max-n-residue] tensor of int32s describing the name3 class for each residue in each Pose (with sentinel values of -1 designating place-holder residues); the integer value here is in reference to the index given by the CanonicalOrdering for the desired name3.
  2. "chain_id": an [n-poses x max-n-residues] tensor describing which chain each residue belongs to; if residue i and residue i+1 are labeled as part of the same chain, and they are both polymeric residue types, then a chemical bond between their "down" and "up" connection points will be included.
  3. "coords": an [n-poses x max-n-residues x max-n-canonical-atoms x 3] tensor describing the coordinate of each atom; any atom's coordinate that is not being provided should be given as NaN, and tmol will build the coordinates it can and complain loudly if a coordinate it requires has not been provided

This "canonical form" representation is intended to be a useful, stable intermediate representation for structures so that they may be serialized to disk (using torch.save, e.g.). Clearly stability here requires that a CanonicalOrdering class that is constructed today to give meaning to the indices in the "res_types" tensor and meaning to the positions within the "coords" tensor must be guaranteed identical to the CanonicalOrdering class that is constructed 6 months from now when tmol supports new exotic chemical types. Thus the purpose of the CanonicalOrdering is to allow the user to control exactly which residue types are in its purview.

From a "canonical form," the API-function pose_stack_from_canonical_form can be invoked. This function's arguments are changing somewhat significantly in this PR. In particular, the argument "atom_is_present" is no longer accepted; if an atom is present, then its coordinate will be non-NaN and if an atom is absent, then its coordinate will be NaN. This PR requires two more required arguments: a CanonicalOrdering object and a PackedBlockTypes object. This PR also makes the other arguments to pose_stack_from_canonical_form keyword-only. The function to "deconstruct" a PoseStack back into its "canonical form," returns a dictionary describing the rest of the arguments pose_stack_from_canonical_form including the "don't add termini variants to certain residues / don't declare chemical bonds between certain residue pairs" argument "res_not_connected" and the "here's a list of the cystein residues that are disulfide-bonded" argument "disulfides." That way a PoseStack can be deconstructed to a canonical form object and then restored to exactly that same PoseStack without having to provide extra arguments.

This PR introduces several API-level functions for converting between popular NN atom ordering conventions, in particular, OpenFold and RosettaFold2. These API functions allow direct creation of PoseStack objects from the outputs generated by these NNs, and also allow creation of "canonical form" dictionaries and the stable CanonicalOrderings that make these dictionaries interpretable.

codecov[bot] commented 9 months ago

Codecov Report

Attention: 17 lines in your changes are missing coverage. Please review.

Comparison is base (ace29a8) 94.93% compared to head (5e6c2e5) 95.26%. Report is 7 commits behind head on master.

Files Patch % Lines
tmol/io/details/build_missing_leaf_atoms.py 96.26% 5 Missing :warning:
tmol/io/canonical_ordering.py 98.25% 3 Missing :warning:
tmol/chemical/patched_chemdb.py 98.19% 2 Missing :warning:
tmol/score/elec/params.py 81.81% 2 Missing :warning:
tmol/io/details/select_from_canonical.py 99.47% 1 Missing :warning:
tmol/io/pose_stack_construction.py 94.44% 1 Missing :warning:
.../tests/io/details/test_build_missing_leaf_atoms.py 98.93% 1 Missing :warning:
tmol/tests/io/test_canonical_form.py 97.77% 1 Missing :warning:
tmol/tests/pack/sim_anneal/test_sim_anneal.py 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #281 +/- ## ========================================== + Coverage 94.93% 95.26% +0.32% ========================================== Files 374 382 +8 Lines 23883 25534 +1651 ========================================== + Hits 22673 24324 +1651 Misses 1210 1210 ``` | [Flag](https://app.codecov.io/gh/uw-ipd/tmol/pull/281/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uw-ipd) | Coverage Δ | | |---|---|---| | [_shrug_Testing_CPU](https://app.codecov.io/gh/uw-ipd/tmol/pull/281/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uw-ipd) | `90.79% <98.31%> (+0.76%)` | :arrow_up: | | [_shrug_Testing_CPU_debug_w_o_jit](https://app.codecov.io/gh/uw-ipd/tmol/pull/281/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uw-ipd) | `92.48% <98.31%> (?)` | | | [_shrug_Testing_CPU_w_o_jit](https://app.codecov.io/gh/uw-ipd/tmol/pull/281/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uw-ipd) | `?` | | | [_shrug_Testing_CUDA](https://app.codecov.io/gh/uw-ipd/tmol/pull/281/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uw-ipd) | `92.86% <99.19%> (+0.48%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uw-ipd#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.