Clarification on NUM_SAMPLES, NUM_FOLDS, and visualize_n_val_graphs - Githubissues

ketatam / DiffDock-PP

Implementation of DiffDock-PP: Rigid Protein-Protein Docking with Diffusion Models in PyTorch (ICLR 2023 - MLDD Workshop)

https://arxiv.org/abs/2304.03889

195 stars 38 forks source link

Clarification on NUM_SAMPLES, NUM_FOLDS, and visualize_n_val_graphs #21

Closed kaymccoy closed 8 months ago

kaymccoy commented 1 year ago

I'm setting up some custom runs, but I'm not certain about what the variables NUM_SAMPLES, NUM_FOLDS, and visualize_n_val_graphs do. Any help would be appreciated!

As far as I can tell from skimming the code, NUM_FOLDS allows for starting the diffusion process with different seeds (i.e. you'd let the diffusion process begin from the same centered and randomly rotated binding partners for each prediction if NUM_FOLDS=1), whereas NUM_SAMPLES refers to the number of poses sampled from the same fold. Then the minimum of the number of samples (which itself is probably NUM_SAMPLES or perhaps NUM_FOLDS * NUM_SAMPLES?) or the value of visualize_first_n_samples is used to actually save pdb files, showing the protein complex structure at each time step of the reverse diffusion process.

However, when I actually run my custom test set through, varying all three of these values does not change what's saved in the visualization directory. Instead, I consistently get 41 ligand files numbered from 0-40, a ligand-gt file, and a receptor file. I assume that the ligand-gt file is the randomized and centered starting position of the ligand, and each ligand structure numbered 0-40 are different time steps, for a single diffusion process.

Could I get some clarification on what those three flags (NUM_FOLDS, NUM_SAMPLES, and visualize_first_n_samples) mean, and how I can save all the ranked predicted final structures as pdb files? Thanks so much for your time.

onlyonewater commented 1 year ago

hi, @kaymccoy , in my understanding, NUM_FOLDS means the number of code runs, if the NUM_FOLDS=5, the code should run five times, NUM_SAMPLES refers to the number of poses sampled from the same fold, I think you are right, and the visualize_first_n_samples means which samples in the training set or validation set should be selected to visualize, if the visualize_first_n_samples is 10 means the first 10 samples in the training set or validation set can be visualized.

kaymccoy commented 1 year ago

@onlyonewater thank you so much for the swift response! I guess in that case, my only issue is that when I change visualize_first_n_samples, I don't change the number of pdb files saved in my visualization folder - it's always 40 for each structure. Do you know what those 40 represent / what I need to fix to actually save more structures? I would like to save all the final predicted structures as well as their rankings according to the confidence model. (Also, love the Conan icon! <3)

onlyonewater commented 1 year ago

@kaymccoy well, in my understanding, now you want to get more structures (> 40) for each protein complex, it depends on the diffusion process, since this model is built on the diffusion models, in the training stage, the code set the NUM_STEPS as 40, so in the inference stage, the model can generate at most 40 candidates structure. So if you want to get more candidate structures, you need to train a new model from scratch (set the NUM_STEPS > 40 in the config file), and in this case, you can get more than 40 structures, but they always cannot surpass the NUM_STEPS which you set. Note: each protein complex has NUM_STEPS candidate structures

onlyonewater commented 1 year ago

@kaymccoy and if you want to train the model from scratch, I have some advice:

you should have at least 4 GPUs, and each GPU has at least 48GB, if each GPU is less than 48GB, it would have an error: CUDA out of memory. Note: if per GPU has 24GB, I think even though you set batch size as 1, it still has an error: CUDA out of memory. So I think 48GB of GPU memory is all you need.
the code may run a little more time, maybe >four days? I guess, and if you want to achieve the performance that the paper mentioned, you need to change the ns, nv parameters in dips_esm.yaml. The detailed parameters you should see in the ./checkpoints/large_model_dips/args.yaml
set the no_graph_cache as False in the config file, it can save you time in the data process, and you can also set the sample_train as False in the config file, which can save you training time.

kaymccoy commented 1 year ago

@onlyonewater thank you so much!! Sorry I wasn't clear; I'm not interested in saving more time steps, but rather saving the time steps for all models generated (e.g. if I enter 2 folds and 5 samples, I would want 2x5x41 structures = 410 structures.) I would also be fine with just saving the final predicted structure for each of the samples / folds. But maybe that's just not possible right now?

However I would actually be interested in retraining on a different dataset! Do you happen to have an idea of how to provide the true structures for comparison to the models, so that the net can learn based on comparing the accuracy of the models to the true structures?

onlyonewater commented 1 year ago

@kaymccoy Well, in my understanding, if you can only save up to 40 structures per sample, i.e., if you enter 2 folds and 5 samples, you can only get 2x5x40=400 structures (at most), if you want to get the final predicted structure, you can generate the 40-th NUM_STEPS structure, because in the diffusion models, the final step means the final prediction results

onlyonewater commented 1 year ago

However I would actually be interested in retraining on a different dataset! Do you happen to have an idea of how to provide the true structures for comparison to the models, so that the net can learn based on comparing the accuracy of the models to the true structures?

well, I cannot understand your meaning, since in this paper, the authors use the diffusion models, which predict the noise in each timestamp, so if you want to predict the structure and compare with the ground-truth structure, you can refer to EQUIDOCK.

Reference: Ganea O E, Huang X, Bunne C, et al. Independent se (3)-equivariant models for end-to-end rigid protein docking[J]. arXiv preprint arXiv:2111.07786, 2021.

kaymccoy commented 8 months ago

I had to work on another project, but I've gotten back to this and reviewed the code in more detail, to figure out the answers to my questions! I believe these are the answers, in case anyone had similar questions; please correct me if I'm wrong.

NUM_FOLDS allows for training with different splits of data, then evaluating which is best at the end.
Then NUM_SAMPLES refers to the number of poses sampled from the same binding pair to add to your BindingDataset.
visualize_first_n_samples gets you the first n samples of the diffusion process, but not for the top ranked structure - just for one of the diffusion processes, before the results are sorted. I have since made my own edit that allows you to save all the final structures named by their confidence ranking; that's not something this code has available as of 4/2/24, but it's a pretty simple edit to make!
The loss function compares the model generated to the original structures inputted, so that your protein pairs provided for docking must be either the true structures with coordinates matching their original / true positions, or models / unbound structures that have been pre-aligned to the true structures.

I'll close this issue now!