Closed kaymccoy closed 8 months ago
hi, @kaymccoy , in my understanding, NUM_FOLDS means the number of code runs, if the NUM_FOLDS=5, the code should run five times, NUM_SAMPLES refers to the number of poses sampled from the same fold, I think you are right, and the visualize_first_n_samples means which samples in the training set or validation set should be selected to visualize, if the visualize_first_n_samples is 10 means the first 10 samples in the training set or validation set can be visualized.
@onlyonewater thank you so much for the swift response! I guess in that case, my only issue is that when I change visualize_first_n_samples, I don't change the number of pdb files saved in my visualization folder - it's always 40 for each structure. Do you know what those 40 represent / what I need to fix to actually save more structures? I would like to save all the final predicted structures as well as their rankings according to the confidence model. (Also, love the Conan icon! <3)
@kaymccoy well, in my understanding, now you want to get more structures (> 40) for each protein complex, it depends on the diffusion process, since this model is built on the diffusion models, in the training stage, the code set the NUM_STEPS as 40, so in the inference stage, the model can generate at most 40 candidates structure. So if you want to get more candidate structures, you need to train a new model from scratch (set the NUM_STEPS > 40 in the config file), and in this case, you can get more than 40 structures, but they always cannot surpass the NUM_STEPS which you set. Note: each protein complex has NUM_STEPS candidate structures
@kaymccoy and if you want to train the model from scratch, I have some advice:
CUDA out of memory
. Note: if per GPU has 24GB, I think even though you set batch size as 1, it still has an error: CUDA out of memory
. So I think 48GB of GPU memory is all you need.dips_esm.yaml
. The detailed parameters you should see in the ./checkpoints/large_model_dips/args.yaml
@onlyonewater thank you so much!! Sorry I wasn't clear; I'm not interested in saving more time steps, but rather saving the time steps for all models generated (e.g. if I enter 2 folds and 5 samples, I would want 2x5x41 structures = 410 structures.) I would also be fine with just saving the final predicted structure for each of the samples / folds. But maybe that's just not possible right now?
However I would actually be interested in retraining on a different dataset! Do you happen to have an idea of how to provide the true structures for comparison to the models, so that the net can learn based on comparing the accuracy of the models to the true structures?
@kaymccoy Well, in my understanding, if you can only save up to 40 structures per sample, i.e., if you enter 2 folds and 5 samples, you can only get 2x5x40=400 structures (at most), if you want to get the final predicted structure, you can generate the 40-th NUM_STEPS structure, because in the diffusion models, the final step means the final prediction results
However I would actually be interested in retraining on a different dataset! Do you happen to have an idea of how to provide the true structures for comparison to the models, so that the net can learn based on comparing the accuracy of the models to the true structures?
well, I cannot understand your meaning, since in this paper, the authors use the diffusion models, which predict the noise in each timestamp, so if you want to predict the structure and compare with the ground-truth structure, you can refer to EQUIDOCK.
Reference: Ganea O E, Huang X, Bunne C, et al. Independent se (3)-equivariant models for end-to-end rigid protein docking[J]. arXiv preprint arXiv:2111.07786, 2021.
I had to work on another project, but I've gotten back to this and reviewed the code in more detail, to figure out the answers to my questions! I believe these are the answers, in case anyone had similar questions; please correct me if I'm wrong.
I'll close this issue now!
I'm setting up some custom runs, but I'm not certain about what the variables NUM_SAMPLES, NUM_FOLDS, and visualize_n_val_graphs do. Any help would be appreciated!
As far as I can tell from skimming the code, NUM_FOLDS allows for starting the diffusion process with different seeds (i.e. you'd let the diffusion process begin from the same centered and randomly rotated binding partners for each prediction if NUM_FOLDS=1), whereas NUM_SAMPLES refers to the number of poses sampled from the same fold. Then the minimum of the number of samples (which itself is probably NUM_SAMPLES or perhaps NUM_FOLDS * NUM_SAMPLES?) or the value of visualize_first_n_samples is used to actually save pdb files, showing the protein complex structure at each time step of the reverse diffusion process.
However, when I actually run my custom test set through, varying all three of these values does not change what's saved in the visualization directory. Instead, I consistently get 41 ligand files numbered from 0-40, a ligand-gt file, and a receptor file. I assume that the ligand-gt file is the randomized and centered starting position of the ligand, and each ligand structure numbered 0-40 are different time steps, for a single diffusion process.
Could I get some clarification on what those three flags (NUM_FOLDS, NUM_SAMPLES, and visualize_first_n_samples) mean, and how I can save all the ranked predicted final structures as pdb files? Thanks so much for your time.