(dev) Support custom datasets. Add + update dataset content. Improve functionality.

Description

This is a massive PR that improves both the dataset and related functionality of SidechainNet.

Summary of changes

Support custom, user-specified datasets.
Add original sequence information (3-letter AA codes before AAs are "standardized").
Make it much easier to reproduce SidechainNet (see scn.create, scn.generate_all) by preprocessing and storing ProteinNet online for user access.
Smaller additions and improvements.

Part 1 - Supporting User-specified Datasets

The following functions have been added to support users in developing datasets. Users may specify which proteins to include and which dataset splits to assign them to. See Section 5 in the Colab Walkthrough for a detailed example.

scn.get_proteinnet_ids

def get_proteinnet_ids(casp_version, split, thinning=None):
    """Return a list of ProteinNet IDs for a given CASP version, split, and thinning.

    Args:
        casp_version (int): CASP version (7, 8, 9, 10, 11, 12).
        split (string): Dataset split ('train', 'valid', 'test'). Validation sets may
            also be specified, ('valid-10', 'valid-20, 'valid-30', 'valid-40', 
            'valid-50', 'valid-70', 'valid-90'). If no valid split is specified, all
            validation set splits will be returned. If split == 'all', the training,
            validation, and testing set splits for the specified CASP and training set
            thinning are all returned.
        thinning (int): Training dataset split thinning (30, 50, 70, 90, 95, 100). Default
            None.

    Returns:
        List: Python list of strings representing the ProteinNet IDs in the requested
            split.
    """

scn.create_custom

def create_custom(pnids,
              output_filename,
              proteinnet_out="data/proteinnet/",
              sidechainnet_out="data/sidechainnet/",
              short_description="Custom SidechainNet dataset.",
              regenerate_scdata=False):
"""Generate a custom SidechainNet dataset from user-specified ProteinNet IDs.

This function utilizes a concatenated version of ProteinNet generated by the author.
This dataset contains the 100% training set thinning from CASP 12, as well as the
concatenation of every testing and validation sets from CASPs 7-12. By collecting
this information into one directory (which this function downloads), the user can
specify any set of ProteinNet IDs that they would like to include, and this
function will be abel to access such data if it is available.

Args:
    pnids (List): List of ProteinNet-formatted protein identifiers (i.e., formmated
        according to <class>#<pdb_id>_<chain_number>_<chain_id>. ASTRAL identifiers
        are also supported, <class>#<pdb_id>_<ASTRAL_id>.)
    output_filename (str): Path to save custom dataset (a pickled Python
        dictionary). ".pkl" extension is recommended.
    proteinnet_out (str, optional): Path to save processed ProteinNet data.
        Defaults to "data/proteinnet/".
    sidechainnet_out (str, optional): Path to save processed SidechainNet data.
        Defaults to "data/sidechainnet/".
    short_description (str, optional): A short description provided by the user to
        describe the dataset. Defaults to "Custom SidechainNet dataset.".
    regenerate_scdata (bool, optional): If true, regenerate raw sidechain-applicable
        data instead of searching for data that has already been preprocessed.
        Defaults to False.

Returns:
    dict: Saves and returns the requested custom SidechainNet dictionary.
"""

scn.utils.download.download_complete_proteinnet

def download_complete_proteinnet(user_dir=None):
"""Download and return path to complete ProteinNet (all CASPs).

Args:
    user_dir (str, optional): If provided, download the ProteinNet data here.
        Otherwise, download it to sidechainnet/resources/custom.

Returns:
    dir_path (str): Path to directory where custom ProteinNet data was downloaded to.
"""

Part 2 - Changes and Additions to Data

Add the following entries to the SidechainNet datasets:
- mod
  - Contains a 1 or 0 for every residue in a protein, with 1 marking residues that have been slightly modified during SidechainNet's construction. For example, Selenomethionine is a modified residue. Rather than excluding it from SidechainNet, we "standardize" it by regenerating its coordinates from its angles as if the residue was a Methionine. This ensures that bond lengths and angles are consistent, even though the real residue may not exactly match the residue we are replacing it with. This procedure is currently implemented using the amino acid reassignments specified in ALLOWED_NONSTD_RESIDUES (see below).
  - This data feature is also accessible when using SidechainNet's custom PyTorch dataloaders and the Batch namedtuple objects that they yield during training. For clarity, the data is accessible via the is_modified attribute (i.e. batch.is_modified, which returns a batch-padded tensor with each entry being the corresponding mod vector from the dataset).
- ums (stands for UnModified Sequence)
  - A string representing each protein's unmodified sequence. This is similar to the seq entry, but ums may contain amino acids that are not allowed in SidechainNet. Because there are potentially many different amino acids in this field, we must represent the sequence as a series of 3-letter amino acid codes separated by spaces, instead of the 1-letter codes used in seq.

ALLOWED_NONSTD_RESIDUES = {
"ASX": "ASP",
"GLX": "GLU",
"CSO": "CYS",
"HIP": "HIS",
"HSD": "HIS",
"HSE": "HIS",
"HSP": "HIS",
"MSE": "MET",
"SEC": "CYS",
"SEP": "SER",
"TPO": "THR",
"PTR": "TYR",
"XLE": "LEU",
"4FB": "PRO",
"MLY": "LYS",  # N-dimethyl-lysine
"AIB": "ALA",  # alpha-methyl-alanine, not included during generation on 1/23/22
"MK8": "MET"   # 2-methyl-L-norleucine, added 3/16/21
}

Part 3 - Smaller Changes

StructureBuilder.to_pdb() (and its underlying datastructure, PdbBuilder) will now also generate SEQRES records for each constructed PDB file. This allows PDB file visualization programs like PyMol to be aware of where missing residues are located. Note that this still only supports the length-20 standard amino acid vocabulary used by SidechainNet.
Add StructureBuilder.to_pdbstr().
Improve imports, minor textual edits.
Rename batch.ress to batch.resolutions.
Add batch.lengths information containing sequence lengths when batching.
Implement scn.generate_all() which can be used to generate all the datasets in SidechainNet (reproduction).

Todos

[x] Allow users to load any previous set of ProteinNet IDs with scn.get_proteinnet_ids (raw data in sidechainnet/resources/all_proteinnet_ids.csv)
[x] Allow users to specify any list of proteins from which to create a SidechainNet dataset (IDs must be in ProteinNet format)
[x] Allow users to specify custom and arbitrary validation set splits (<split_num>#<pdb_id>_<chain_id>_<model_num>)
[x] Provide complete support for ASTRAL entries that are not previously included in ProteinNet (need to successfully determine their sequence or exclude these entries)
[x] Make it easier for users to generate new SidechainNet datasets by compiling all ProteinNet datasets into a single resource that can be accessed via scn.utils.download.download_complete_proteinnet. To do this, I have taken the training set from CASP12, and concatenate the validation and testing sets from all previous CASPs. This means the user will not have to download ProteinNet data on their own!

Status

[x] Ready to go

jonathanking / sidechainnet

(dev) Support custom datasets. Add + update dataset content. Improve functionality. #30