Clarification on Pocket2Mol model training with CrossDocked2020 dataset

mahmoud-ekhani commented 1 year ago

Hello, I appreciate your excellent work on the Pocket2Mol model. I am also attempting to train this model on a larger dataset and have started with the CrossDocked2020-v1.3 dataset, which has an RMSD < 2A. This dataset already includes clustered training and test data distributions, so I'm unsure about the need for mmseqs2 clustering. Could you please explain the reason for this step?

Additionally, the CrossDocked2020 dataset has various docking forms, such as Autodock Vina docked poses of ligands in the receptor and the first and second iterations of CNN-optimized poses. Which of these did you use in your training process?

Lastly, I noticed that the .PDB files in your training dataset are smaller than those in the CrossDocked2020 dataset. Did you perform any extra processing steps to obtain these smaller receptor files?

Thank you in advance for your time and assistance!

pengxingang commented 1 year ago

Hi, sorry for the late reply. The details about the processing of the crossdocked dataset have been elaborated in our new work TargetDiff. You can go to that repository for detailed processing from scratch.

mahmoud-ekhani commented 1 year ago

Awesome! Thank you for your response!

pengxingang / Pocket2Mol

Clarification on Pocket2Mol model training with CrossDocked2020 dataset #21