Closed yuanqidu closed 2 years ago
The data file that references the structures in the data/crossdock2020/ directory is data/crossdock2020/selected_test_targets.types. This only includes the 10 targets that we selected for test evaluations, so I would not advise training a model on it.
Instructions for downloading the full crossdocked dataset can be found here: https://github.com/gnina/models/tree/master/data/CrossDocked2020
If you would like to train a model from scratch using the full CrossDocked2020 data set, I can make the required data files (.types files) available.
Thanks for your help! Yes, I did try to use crossdock for training, but the code just froze at next_batch under libmolgrid library.
Can you provide your conda environment?
my conda environment is with torch 1.10.0+cu102
How did you install openbabel and libmolgrid?
@yuanqidu posted this error, before editing the message above:
examples = self.ex_provider.next_batch(self.batch_size)
ValueError: Vector typer used with gninatypes files
Isn't this related to a mismatch between the types contained in the .gninatypes
files for the CrossDocked2020 dataset (original typing scheme used in GNINA) and the new typing scheme that the refactored version of liGAN uses?
@RMeli Yes, that error message indicates that you are trying to train using gninatypes files or a molcache2 file, which is not compatible with vectorized atom typing that is used by this project. If that's the case, you have to download the full crossdocked2020 dataset (the structure files, not the molcaches) and use that instead. You can reference config/train.config file for how the data paths should be set up.
Thanks for the catch... I loaded the wrong package from the crossdock dataset for that run, BUT it was not the main reason, I deleted it because I corrected it very soon. So, it was not the error. I still stuck with the loading data step...
I install openbabel via conda install openbabel -c conda-forge and molgrid via pip install molgrid
Wait a second, I think you are right, I am still getting this "Vector typer used with gninatypes files" error even when I use the full crossdock dataset.
I basically download the full set of crossdock and the types. I used train and test file: types/it2_tt_v1.1_0_test0.types I set data_root as the CrossDocked2020 dataset following the example in the train.config file, but still getting this error BTW I didn't see any type files under the full set of crossdocked2020, but a separate types folder.
Some observations: the above error could be circumvented with the very small test type file provided in THIS repo. BUT it still could not proceed and froze at next_batch
Those train and test files reference the custom gninatypes format that is not compatible with liGAN. You should use the following train and test files:
http://bits.csb.pitt.edu/files/it2_tt_0_lowrmsd_mols_train0.types http://bits.csb.pitt.edu/files/it2_tt_0_lowrmsd_mols_test0.types
Thanks so much. But it seems the error is still there, when calling next_batch, it froze forever and later the process is killed.
It looks like this error is inside libmolgrid library
In that case, this is probably due to a mismatch between the conda-installed openbabel and the version used by pip-installed molgrid. This is a known issue that is resolvable by building libmolgrid from source using the version of openbabel that you have installed on your system.
When I tried to manually build the libmolgrid, I could not use the command apt install libeigen3-dev libboost-all-dev since I am on centos 7. What openbabel version is compatible with the package? May I re-install the openbabel instead of the molgrid package?
Finally, after manually installing libmolgrid, I solved the problem.
But I have one more question, the files mentioned in the given types for training are .sdf.gz, while in the dataset donwloaded are gninatypes, there is a mismatch.
Great, I'm glad you were able to manually build libmolgrid.
Please follow the steps in the download_data.sh script to download the full crossdocked dataset, which should include .sdf.gz:
#!/bin/bash
wget https://bits.csb.pitt.edu/files/crossdock2020/CrossDocked2020_v1.1.tgz -P data/crossdock2020/
tar -C data/crossdock2020/ -xzf data/crossdock2020/CrossDocked2020_v1.1.tgz
wget https://bits.csb.pitt.edu/files/it2_tt_0_lowrmsd_mols_train0.types -P data/crossdock2020/
wget https://bits.csb.pitt.edu/files/it2_tt_0_lowrmsd_mols_test0.types -P data/crossdock2020/
Also note that the types2xyz.py script that David mentioned in the libmolgrid issue you opened will not work for this project, since we use an alternative typing scheme. The gninatypes files contain atoms that already have a different type scheme applied, so there is no way to use them for this project. You will need to download the original molecule files to train a generative model.
Thanks, I did download the dataset following the bash script. But in the folder it contains many ginatypes while the code in liGAN asks for sdf. I think there is a mismatch and when I run the code, it reports file cannot be open/found.
Ah, the problem is that the molecules in the downloaded crossdocked2020 dataset contain multiple poses, but we need them to be split out so that individual poses are in separate files. There is a script to do this but the step is currently missing from the download_data.sh script, my apologies. I will update the script and let you know how to run it ASAP.
Yes, thanks, please :)
The following commands will split out the poses in the sdf.gz files that you need for training:
# split multi-pose files into single-pose files
python scripts/split_sdf.py data/crossdock2020/it2_tt_0_lowrmsd_mols_train0.types data/crossdock2020
python scripts/split_sdf.py data/crossdock2020/it2_tt_0_lowrmsd_mols_test0.types data/crossdock2020
Just FYI, there are some issues with the data set that I am working to resolve. The train and test files will have to be updated to remove some bad/missing molecules.
Thanks very much for your help!
May I just move the missing files from the types file?
I have one more question about the method. Did you first extract the pocket from the protein or did you use the full protein and ligand for conditional generation? If you did extract pocket, how did you do so?
Yes, you can simply remove the missing data rows from the .types files. We provide the full protein as input to conditional generation, but only the binding pocket will fit in the grid bounds. The grid will be centered on the reference ligand, so it is assumed that the reference ligand is in the binding pocket.
There are functions to extract pockets for UFF minimization in the binding pocket context (see liGAN.molecules.get_rd_mol_pocket
), but I would not recommend doing for proteins that are input to the generative model
Hello, I ran into the same problem. What is your openbabel version? How to install lib manually? @yuanqidu Looking forward to your help!
Hello @yuanqidu @wxfsd, I wanted to update you on this issue as I am actively working to resolve it. The problem is that there is an incompatibility between the conda-installed openbabel and conda/pip-installed molgrid (they are the same binary). I am working on a conda build recipe that will hopefully resolve this issue (https://github.com/mattragoza/conda-molgrid), but it is still under construction. I have provided an environment.yaml file in the conda-molgrid repo that you should be able to use to create a conda environment in which you can successfully build molgrid from source. Please let me know if you run into issues using this conda environment (if you do, please open an issue in the conda-molgrid repo).
Okay, I'm trying, I will ask on that link (https://github.com/mattragoza/conda-molgrid) if I have any issues. Thank you very much. @mattragoza
@yuanqidu also I have uploaded new types files that have the problematic poses removed, you can find the the links in the download_data.sh script
Thanks! May I ask how many (percentage) problematic files were removed?
Happy new year!
9 total poses were removed from the data files.
When I attempted to train the code with train.py and the provided simple example it2_tt_0_lowrmsd_valid_mols_test0_1000.types and crossdock folder, the model froze in the dataloader next_batch part.
Looking forward to your help!