jonathanking / sidechainnet

An all-atom protein structure dataset for machine learning.
BSD 3-Clause "New" or "Revised" License
322 stars 36 forks source link

Merge Dataset improvements into Create Custom #27

Closed jonathanking closed 3 years ago

jonathanking commented 3 years ago

Description

This PR makes several improvements to SidechainNet. This includes additional information for the dataset as well as improved handling of modified residues and PDB file generation.

Summary of changes

  1. Add the following entries to the SidechainNet datasets:
    • mod
      • Contains a 1 or 0 for every residue in a protein, with 1 marking residues that have been slightly modified during SidechainNet's construction. For example, Selenomethionine is a modified residue. Rather than excluding it from SidechainNet, we "standardize" it by regenerating its coordinates from its angles as if the residue was a Methionine. This ensures that bond lengths and angles are consistent, even though the real residue may not exactly match the residue we are replacing it with. This procedure is currently implemented using the amino acid reassignments specified in ALLOWED_NONSTD_RESIDUES (see below).
      • This data feature is also accessible when using SidechainNet's custom PyTorch dataloaders and the Batch namedtuple objects that they yield during training. For clarity, the data is accessible via the is_modified attribute (i.e. batch.is_modified, which returns a batch-padded tensor with each entry being the corresponding mod vector from the dataset).
    • ums (stands for UnModified Sequence)
      • A string representing each protein's unmodified sequence. This is similar to the seq entry, but ums may contain amino acids that are not allowed in SidechainNet. Because there are potentially many different amino acids in this field, we must represent the sequence as a series of 3-letter amino acid codes separated by spaces, instead of the 1-letter codes used in seq.
  2. StructureBuilder.to_pdb() (and it's underlying datastructure, PdbBuilder) will now also generate SEQRES records for each constructed PDB file. This allows PDB file visualization programs like PyMol to be aware of where missing residues are located. Note that this still only supports the length-20 standard amino acid vocabulary used by SidechainNet.
  3. Add StructureBuilder.to_pdbstr().
  4. Improve imports, minor textual edits.
  5. Rename batch.ress to batch.resolutions.
ALLOWED_NONSTD_RESIDUES = {
"ASX": "ASP",
"GLX": "GLU",
"CSO": "CYS",
"HIP": "HIS",
"HSD": "HIS",
"HSE": "HIS",
"HSP": "HIS",
"MSE": "MET",
"SEC": "CYS",
"SEP": "SER",
"TPO": "THR",
"PTR": "TYR",
"XLE": "LEU",
"4FB": "PRO",
"MLY": "LYS",  # N-dimethyl-lysine
"AIB": "ALA",  # alpha-methyl-alanine, not included during generation on 1/23/22
"MK8": "MET"   # 2-methyl-L-norleucine, added 3/16/21
}

Status