Looking for CSAR dataset and test folds for pose prediction task

gnina / models

Trained caffe models

81 stars 23 forks source link

Looking for CSAR dataset and test folds for pose prediction task #35

Open Kerro-junior opened 10 months ago

Kerro-junior commented 10 months ago

As the paper Protein−Ligand Scoring with Convolutional Neural Networks says: The performance of trained CNN models were evaluated by 3-fold cross-validation for both the pose prediction and virtual screening tasks. To avoid evaluating models on targets similar to those in the training set, training and test folds were constructed by clustering data based on target families rather than individual targets.

But I couldn't find the dataset here, and I didn't know how you construct test folds by target families...(also couldn't find the test fold for pose predictions here)

dkoes commented 10 months ago

The method: https://github.com/gnina/scripts/blob/master/clustering.py which is described in the readme for the scripts repository.

We use the Crossdocked set now though.

OliviaViessmann commented 7 months ago

I have a similar question -- would it be possible for you to upload a txt with the matching of receptor_ids to cluster id for the CrossDocked data set? To avoid rerunning the computation and matching your splits? Thanks

dkoes commented 7 months ago

The full crossdocked set is available here: https://bits.csb.pitt.edu/files/crossdock2020/ You can get the receptor ids from the types files.

OliviaViessmann commented 7 months ago

Fantastic. Thank you for the swift reply. I see "receptor_ids" in the format of "XXX/pdb_id_Y". Am I understanding correctly that the "XXX" prefix is the id of the "structural homology cluster", i.e. any pocket with an id prefix "XXX" is similar according to the Pocketome? That being said -- is there any information on how structural similarity between pockets was determined? I tried to look up the website of pocketome.org for a definition but unfortunately it appears to be down. The reason why I am asking: I am trying to assess how to set up a fair train/test split to evaluate generalizability. Regards

dkoes commented 7 months ago

Yes, XXX (e.g. 1433B_HUMAN_1_240_pep_0) is the Pocketome cluster. The original Pocketome paper is here. We clustered the pocketome pockets using sequence similarity with the clustering.py script.