Open Kerro-junior opened 10 months ago
The method: https://github.com/gnina/scripts/blob/master/clustering.py which is described in the readme for the scripts repository.
We use the Crossdocked set now though.
I have a similar question -- would it be possible for you to upload a txt with the matching of receptor_ids to cluster id for the CrossDocked data set? To avoid rerunning the computation and matching your splits? Thanks
The full crossdocked set is available here: https://bits.csb.pitt.edu/files/crossdock2020/ You can get the receptor ids from the types files.
Fantastic. Thank you for the swift reply. I see "receptor_ids" in the format of "XXX/pdb_id_Y". Am I understanding correctly that the "XXX" prefix is the id of the "structural homology cluster", i.e. any pocket with an id prefix "XXX" is similar according to the Pocketome? That being said -- is there any information on how structural similarity between pockets was determined? I tried to look up the website of pocketome.org for a definition but unfortunately it appears to be down. The reason why I am asking: I am trying to assess how to set up a fair train/test split to evaluate generalizability. Regards
Yes, XXX (e.g. 1433B_HUMAN_1_240_pep_0) is the Pocketome cluster. The original Pocketome paper is here. We clustered the pocketome pockets using sequence similarity with the clustering.py script.
As the paper
Protein−Ligand Scoring with Convolutional Neural Networks
says:The performance of trained CNN models were evaluated by 3-fold cross-validation for both the pose prediction and virtual screening tasks. To avoid evaluating models on targets similar to those in the training set, training and test folds were constructed by clustering data based on target families rather than individual targets.
But I couldn't find the dataset here, and I didn't know how you construct test folds by target families...(also couldn't find the test fold for pose predictions here)