mattragoza / LiGAN

Deep generative models of 3D grids for structure-based drug discovery
GNU General Public License v2.0
225 stars 44 forks source link

model cannot be trained #28

Closed yuanqidu closed 2 years ago

yuanqidu commented 2 years ago

When I attempted to train the code with train.py and the provided simple example it2_tt_0_lowrmsd_valid_mols_test0_1000.types and crossdock folder, the model froze in the dataloader next_batch part.

Looking forward to your help!

mattragoza commented 2 years ago

The data file that references the structures in the data/crossdock2020/ directory is data/crossdock2020/selected_test_targets.types. This only includes the 10 targets that we selected for test evaluations, so I would not advise training a model on it.

Instructions for downloading the full crossdocked dataset can be found here: https://github.com/gnina/models/tree/master/data/CrossDocked2020

If you would like to train a model from scratch using the full CrossDocked2020 data set, I can make the required data files (.types files) available.

yuanqidu commented 2 years ago

Thanks for your help! Yes, I did try to use crossdock for training, but the code just froze at next_batch under libmolgrid library.

mattragoza commented 2 years ago

Can you provide your conda environment?

yuanqidu commented 2 years ago

my conda environment is with torch 1.10.0+cu102

mattragoza commented 2 years ago

How did you install openbabel and libmolgrid?

RMeli commented 2 years ago

@yuanqidu posted this error, before editing the message above:

examples = self.ex_provider.next_batch(self.batch_size)
ValueError: Vector typer used with gninatypes files

Isn't this related to a mismatch between the types contained in the .gninatypes files for the CrossDocked2020 dataset (original typing scheme used in GNINA) and the new typing scheme that the refactored version of liGAN uses?

mattragoza commented 2 years ago

@RMeli Yes, that error message indicates that you are trying to train using gninatypes files or a molcache2 file, which is not compatible with vectorized atom typing that is used by this project. If that's the case, you have to download the full crossdocked2020 dataset (the structure files, not the molcaches) and use that instead. You can reference config/train.config file for how the data paths should be set up.

yuanqidu commented 2 years ago

Thanks for the catch... I loaded the wrong package from the crossdock dataset for that run, BUT it was not the main reason, I deleted it because I corrected it very soon. So, it was not the error. I still stuck with the loading data step...

I install openbabel via conda install openbabel -c conda-forge and molgrid via pip install molgrid

Wait a second, I think you are right, I am still getting this "Vector typer used with gninatypes files" error even when I use the full crossdock dataset.

I basically download the full set of crossdock and the types. I used train and test file: types/it2_tt_v1.1_0_test0.types I set data_root as the CrossDocked2020 dataset following the example in the train.config file, but still getting this error BTW I didn't see any type files under the full set of crossdocked2020, but a separate types folder.

Some observations: the above error could be circumvented with the very small test type file provided in THIS repo. BUT it still could not proceed and froze at next_batch

mattragoza commented 2 years ago

Those train and test files reference the custom gninatypes format that is not compatible with liGAN. You should use the following train and test files:

http://bits.csb.pitt.edu/files/it2_tt_0_lowrmsd_mols_train0.types http://bits.csb.pitt.edu/files/it2_tt_0_lowrmsd_mols_test0.types

yuanqidu commented 2 years ago

Thanks so much. But it seems the error is still there, when calling next_batch, it froze forever and later the process is killed. image

It looks like this error is inside libmolgrid library

mattragoza commented 2 years ago

In that case, this is probably due to a mismatch between the conda-installed openbabel and the version used by pip-installed molgrid. This is a known issue that is resolvable by building libmolgrid from source using the version of openbabel that you have installed on your system.

yuanqidu commented 2 years ago

When I tried to manually build the libmolgrid, I could not use the command apt install libeigen3-dev libboost-all-dev since I am on centos 7. What openbabel version is compatible with the package? May I re-install the openbabel instead of the molgrid package?

yuanqidu commented 2 years ago

Finally, after manually installing libmolgrid, I solved the problem.

But I have one more question, the files mentioned in the given types for training are .sdf.gz, while in the dataset donwloaded are gninatypes, there is a mismatch.

mattragoza commented 2 years ago

Great, I'm glad you were able to manually build libmolgrid.

Please follow the steps in the download_data.sh script to download the full crossdocked dataset, which should include .sdf.gz:

#!/bin/bash
wget https://bits.csb.pitt.edu/files/crossdock2020/CrossDocked2020_v1.1.tgz -P data/crossdock2020/
tar -C data/crossdock2020/ -xzf data/crossdock2020/CrossDocked2020_v1.1.tgz
wget https://bits.csb.pitt.edu/files/it2_tt_0_lowrmsd_mols_train0.types -P data/crossdock2020/
wget https://bits.csb.pitt.edu/files/it2_tt_0_lowrmsd_mols_test0.types -P data/crossdock2020/
mattragoza commented 2 years ago

Also note that the types2xyz.py script that David mentioned in the libmolgrid issue you opened will not work for this project, since we use an alternative typing scheme. The gninatypes files contain atoms that already have a different type scheme applied, so there is no way to use them for this project. You will need to download the original molecule files to train a generative model.

yuanqidu commented 2 years ago

Thanks, I did download the dataset following the bash script. But in the folder it contains many ginatypes while the code in liGAN asks for sdf. I think there is a mismatch and when I run the code, it reports file cannot be open/found.

image

image

mattragoza commented 2 years ago

Ah, the problem is that the molecules in the downloaded crossdocked2020 dataset contain multiple poses, but we need them to be split out so that individual poses are in separate files. There is a script to do this but the step is currently missing from the download_data.sh script, my apologies. I will update the script and let you know how to run it ASAP.

yuanqidu commented 2 years ago

Yes, thanks, please :)

mattragoza commented 2 years ago

The following commands will split out the poses in the sdf.gz files that you need for training:

# split multi-pose files into single-pose files
python scripts/split_sdf.py data/crossdock2020/it2_tt_0_lowrmsd_mols_train0.types data/crossdock2020
python scripts/split_sdf.py data/crossdock2020/it2_tt_0_lowrmsd_mols_test0.types data/crossdock2020
mattragoza commented 2 years ago

Just FYI, there are some issues with the data set that I am working to resolve. The train and test files will have to be updated to remove some bad/missing molecules.

yuanqidu commented 2 years ago

Thanks very much for your help!

yuanqidu commented 2 years ago

May I just move the missing files from the types file?

yuanqidu commented 2 years ago

I have one more question about the method. Did you first extract the pocket from the protein or did you use the full protein and ligand for conditional generation? If you did extract pocket, how did you do so?

mattragoza commented 2 years ago

Yes, you can simply remove the missing data rows from the .types files. We provide the full protein as input to conditional generation, but only the binding pocket will fit in the grid bounds. The grid will be centered on the reference ligand, so it is assumed that the reference ligand is in the binding pocket.

mattragoza commented 2 years ago

There are functions to extract pockets for UFF minimization in the binding pocket context (see liGAN.molecules.get_rd_mol_pocket), but I would not recommend doing for proteins that are input to the generative model

wxfsd commented 2 years ago

Hello, I ran into the same problem. What is your openbabel version? How to install lib manually? @yuanqidu Looking forward to your help!

mattragoza commented 2 years ago

Hello @yuanqidu @wxfsd, I wanted to update you on this issue as I am actively working to resolve it. The problem is that there is an incompatibility between the conda-installed openbabel and conda/pip-installed molgrid (they are the same binary). I am working on a conda build recipe that will hopefully resolve this issue (https://github.com/mattragoza/conda-molgrid), but it is still under construction. I have provided an environment.yaml file in the conda-molgrid repo that you should be able to use to create a conda environment in which you can successfully build molgrid from source. Please let me know if you run into issues using this conda environment (if you do, please open an issue in the conda-molgrid repo).

wxfsd commented 2 years ago

Okay, I'm trying, I will ask on that link (https://github.com/mattragoza/conda-molgrid) if I have any issues. Thank you very much. @mattragoza

mattragoza commented 2 years ago

@yuanqidu also I have uploaded new types files that have the problematic poses removed, you can find the the links in the download_data.sh script

yuanqidu commented 2 years ago

Thanks! May I ask how many (percentage) problematic files were removed?

Happy new year!

mattragoza commented 2 years ago

9 total poses were removed from the data files.