In the case that users are not interested in SidechainNet's original composition and organization, users will now be able to provide a custom list of ProteinNet identifiers.
Todos
[x] Allow users to load any previous set of ProteinNet IDs with scn.get_proteinnet_ids (raw data in sidechainnet/resources/all_proteinnet_ids.csv)
[x] Allow users to specify any list of proteins from which to create a SidechainNet dataset (IDs must be in ProteinNet format)
[x] Allow users to specify custom and arbitrary validation set splits (<split_num>#<pdb_id>_<chain_id>_<model_num>)
[x] Provide complete support for ASTRAL entries that are not previously included in ProteinNet (need to successfully determine their sequence or exclude these entries)
[x] Make it easier for users to generate new SidechainNet datasets by compiling all ProteinNet datasets into a single resource that can be accessed via scn.utils.download.download_complete_proteinnet. To do this, I have taken the training set from CASP12, and concatenate the validation and testing sets from all previous CASPs. This means the user will not have to download ProteinNet data on their own!
Added functionality
scn.get_proteinnet_ids
def get_proteinnet_ids(casp_version, split, thinning=None):
"""Return a list of ProteinNet IDs for a given CASP version, split, and thinning.
Args:
casp_version (int): CASP version (7, 8, 9, 10, 11, 12).
split (string): Dataset split ('train', 'valid', 'test'). Validation sets may
also be specified, ('valid-10', 'valid-20, 'valid-30', 'valid-40',
'valid-50', 'valid-70', 'valid-90'). If no valid split is specified, all
validation set splits will be returned. If split == 'all', the training,
validation, and testing set splits for the specified CASP and training set
thinning are all returned.
thinning (int): Training dataset split thinning (30, 50, 70, 90, 95, 100). Default
None.
Returns:
List: Python list of strings representing the ProteinNet IDs in the requested
split.
"""
scn.create_custom
def create_custom(pnids,
output_filename,
proteinnet_out="data/proteinnet/",
sidechainnet_out="data/sidechainnet/",
short_description="Custom SidechainNet dataset.",
regenerate_scdata=False):
"""Generate a custom SidechainNet dataset from user-specified ProteinNet IDs.
This function utilizes a concatedated version of ProteinNet generated by the author.
This dataset contains the 100% training set thinning from CASP 12, as well as the
concatenation of every testing and validation sets from CASPs 7-12. By collecting
this information into one directory (which this function downloads), the user can
specify any set of ProteinNet IDs that they would like to include, and this
function will be abel to access such data if it is available.
Args:
pnids (List): List of ProteinNet-formatted protein identifiers (i.e., formmated
according to <class>#<pdb_id>_<chain_number>_<chain_id>. ASTRAL identifiers
are also supported, <class>#<pdb_id>_<ASTRAL_id>.)
output_filename (str): Path to save custom dataset (a pickled Python
dictionary). ".pkl" extension is recommended.
proteinnet_out (str, optional): Path to save processed ProteinNet data.
Defaults to "data/proteinnet/".
sidechainnet_out (str, optional): Path to save processed SidechainNet data.
Defaults to "data/sidechainnet/".
short_description (str, optional): A short description provided by the user to
describe the dataset. Defaults to "Custom SidechainNet dataset.".
regenerate_scdata (bool, optional): If true, regenerate raw sidechain-applicable
data instead of searching for data that has already been preprocessed.
Defaults to False.
Returns:
dict: Saves and returns the requested custom SidechainNet dictionary.
"""
scn.utils.download.download_complete_proteinnet
def download_complete_proteinnet(user_dir=None):
"""Download and return path to complete ProteinNet (all CASPs).
Args:
user_dir (str, optional): If provided, download the ProteinNet data here.
Otherwise, download it to sidechainnet/resources/custom.
Returns:
dir_path (str): Path to directory where custom ProteinNet data was downloaded to.
"""
Description
In the case that users are not interested in SidechainNet's original composition and organization, users will now be able to provide a custom list of ProteinNet identifiers.
Todos
scn.get_proteinnet_ids
(raw data insidechainnet/resources/all_proteinnet_ids.csv
)<split_num>#<pdb_id>_<chain_id>_<model_num>
)scn.utils.download.download_complete_proteinnet
. To do this, I have taken the training set from CASP12, and concatenate the validation and testing sets from all previous CASPs. This means the user will not have to download ProteinNet data on their own!Added functionality
scn.get_proteinnet_ids
scn.create_custom
scn.utils.download.download_complete_proteinnet
Status