mattragoza / LiGAN

Deep generative models of 3D grids for structure-based drug discovery
GNU General Public License v2.0
225 stars 44 forks source link

Training data format #31

Closed mengliu1998 closed 2 years ago

mengliu1998 commented 2 years ago

Hi,

Thank you for this interesting and insightful work. I would like to follow your experimental setting. I have the following questions about the data format.

(1) How many training protein-ligand pairs you have after you filter out any poses that have RMSD greater than 2A? Is it 486740, which is the number of lines in it2_tt_0_lowrmsd_mols_train0_fixed.types.

(2) Could you explain the meaning of each line in it2_tt_0_lowrmsd_mols_train0_fixed.types? For example, what do the fist three numbers mean in the following line?

1 5.119186 1.97462 1433B_HUMAN_1_240_pep_0/4gnt_A_rec.pdb 1433B_HUMAN_1_240_pep_0/4gnt_A_rec_5f74_amp_lig_tt_min_0.sdf.gz #-6.28497

Thank you in advance.

mattragoza commented 2 years ago

(1) The provided types files contain filtered poses that all have RMSD < 2A, so the number of lines in the train0 file is the number of protein-ligand pairs in the training set. (2) Sure. The columns are:

  1. binary label indicating if the pose is less than 2A RMSD from crystal pose
  2. binding affinity (0.0 if not available)
  3. RMSD from crystal pose
  4. path to receptor file, relative to data root
  5. path to ligand file, relative to data root
  6. vina energy (after the hash symbol)
mengliu1998 commented 2 years ago

Hi @mattragoza, Thank you for the response and sorry for my late reply. This solves my question.