gnina / scripts

BSD 3-Clause "New" or "Revised" License
23 stars 83 forks source link

Dependencies: Caffe #42

Closed CHANG-Shaole closed 1 year ago

CHANG-Shaole commented 3 years ago

I installed all the Dependencies following the README.md, without CAFFE. When I tried to run the train.py, I found there was also a dependence on Caffe. So I tried to install Caffe in the environment, but there would be a conflict with other packages. Could anyone tell me a solution, thanks a lot! :)

dkoes commented 3 years ago

gnina comes with its own version of caffe. If you install vanilla caffe it will not work. Make sure your PYTHONPATH is pointing to the version that is installed with gnina.

CHANG-Shaole commented 3 years ago

Thanks a lot for your kind advice, I have done the configuration for the gnina/scripts, and it seems work well. So I want to try to use the predict.py to see the prediction ability of gnina, the paper I refer to is https://doi.org/10.1021/acs.jcim.6b00740. I found there are some caffe model and weights in the gnina/model folder, could I directly use the model, weights, and the data inside to do my first prediction in gnina? If yes, could you recommend me a model, weight and some data for my first prediction?

The work is awesome! Very thank you for your contribuation for gnina !!!

dkoes commented 3 years ago

Yes, you can, but there are many models built into gnina itself which is simpler to use. Why not use gnina with the --score_only option?

CHANG-Shaole commented 3 years ago

Yes, indeed it's simpler. And I think I has done my first try on gnina itself. That's why I want to explore more about the application, (try to use the gnina / scripts). Because I found this project is quite interesting. If you have time, could you recommend me some models, weights or data? If not, I can also explorer it myself! Thanks a lot !!! :)

CHANG-Shaole commented 3 years ago

Sorry to bother you again! I am using the gnina/scripts/train.py to get a .caffemodel weight file. The model and data I used are respectively in gnina/models/asc2018/default2017.model and gnina/models/data/csar. I was not really understand how set the -p for PREFIX, and -d for DATA_ROOT, when I try to configate the train.py. So I didn't know how to set the train/test files. I'm not very good at the script command, could you give me some advice about how to configrate the train.py? Really Really THANK YOU !!! :)

dkoes commented 3 years ago

The csar data is really old and is too small of a training set. I would avoid it in most cases. Grab the crossdocked2020 set (https://bits.csb.pitt.edu/files/crossdock2020/). There are models here: https://github.com/gnina/models/tree/master/crossdocked_paper

Descriptions of prefix and data_root are in the readme: https://github.com/gnina/scripts

CHANG-Shaole commented 3 years ago

Thank you very much for your reply! I'm now downloading the dataset "crossdock2020" for the next training process. In the meantime, I'm trying to train a model using the "csar" dataset, (even though it's small, just for the first training), and the protocol model is "default2017.model", as descripted before.

I encountered a run mistake: (The run command is: python train.py -m models/default2017.model -p 'all' -g 1)

Traceback (most recent call last): File "train.py", line 932, in results = train_and_test_model(args, train_test_files[i], outname, cont) File "train.py", line 441, in train_and_test_model solver = caffe.get_solver(solverf) ValueError: Could not open completerec

I can't find any useful information for this special mistake online. So I want to ask if you ever encountered this mistake, or could you give me some advice for solving this mistake?

Sincerely!

dkoes commented 3 years ago

You have specified a model with an explicit type map (called completerec). completerec and completelig can be found here: https://github.com/gnina/models/tree/master/crossdocked_paper

CHANG-Shaole commented 3 years ago

Hope you are still have patience for my question. Thanks to you, after I moved the "completelig", "completerec" to my "scripts" folder, the previous mistake was solved. But there is a new mistake when I tried to run the "train.py",

_Traceback (most recent call last): File "train.py", line 940, in results = train_and_test_model(args, train_test_files[i], outname, cont) File "train.py", line 446, in train_and_test_model solver = caffe.get_solver(solverf) ValueError: Missing molecular data in line: 1 data/project/scripts/set2/297/rec.gninatypes data/project/scripts/set2/297/docked0.gninatypes # 0.757109 -9.987190

I believe this mistake is relevant to the data path, so I tried the "relative path", and also the "absolute path" for the ".gninatypes" files. Unfortunately, the mistake is still here. Maybe this mistake is simple, but it indeed stop me a lot, so could you give me a solution for that?

Sincerely!

dkoes commented 3 years ago

data_root needs to point to where your data directory is and the data has to exist

CHANG-Shaole commented 3 years ago

I think I've made data_root point to the directory of the data, and the data indeed exists. (If my understanding is correct: the .type file includes a list of paths about the .gninatyes, while the .gninatypes file is the data).

In the content of my script, I have .type files and also the .gninatype files. So I just feel strange why the mistake is still here if the path is right and the data also exists ...

In addition, I also encountered the same mistake when running the predict.py ... It's evening in my country now, if it's not, I could show you the mistake much clear. Pretty strange...

CHANG-Shaole commented 3 years ago

Command:

python3 train.py -m 'models/default2017_change.model' -p 'all' -d '/data/project/2017/'

data location: (csar)

under the folder with the scripts: /data/project/2017/ includes "alltrain0.types, alltrain1.types, alltrain3.types, alltest0.types, alltest1.types, alltest2.types" and "set1, set2, set3" folders (.gninatypes files).

Mistake description:

_Traceback (most recent call last): File "train.py", line 940, in results = train_and_test_model(args, train_test_files[i], outname, cont) File "train.py", line 446, in train_and_test_model solver = caffe.get_solver(solverf) ValueError: Missing molecular data in line: 1 set2/297/rec.gninatypes set2/297/docked0.gninatypes # 0.757109 -9.987190

Could you help me find some clues from the command or the data locations? Thanks a lot!!!

CHANG-Shaole commented 3 years ago

Very thank you for the help these days!

I thought about this "path" mistake almost all day. For now, I think the mistake is relevant to the "MolGridData" layer. I think this layer is used to do the data pre-processing, also used to load the data. Maybe it's a clue for the mistake.

However, I'm not sure how does this layer do inside, and for the deep learning framework, I'm also not familiar with caffe. So I can't find a solution myself. I will be very appreciate if you can provide me a solution or some clues about the mistake.

dkoes commented 3 years ago

In order for train.py to set the data root in the model file, it must be specified with the placeholder value DATA_ROOT (so the script can do a find/replace using the commandline argument). Is that the case in your model file or has the root_folder property been set to something else?

CHANG-Shaole commented 3 years ago

Thank you very much of your answer for the "path" mistake!

Today I used a different model, it's can be run normally. Even though it's quite strange, finally it's at least running.

Another point I'm confused is about the ".gninatype" file. I saw your paper saied you were using the "grid" image for CNN model to process. I think there should be two types of file of the molecular, one is the tradictional molecular file, like .pdb file, another is the "grid" molecular file, it's for the CNN input.

So I want to ask what type is the ".gninatypes" file (like ".pdb" file or "grid" file). Because in your paper, I noticed the layer "MolGridData" is used to do the gridding, so I believe after this layer, the data would become the "grid" type.

For me, I'm not the field of the Biochemistry, but the image processing. Maybe my understanding is not correct. Could you explain me the ".gninatypes" file and how to do the data pre-processing (just the pipeline)?

dkoes commented 3 years ago

gninatypes contains x,y,z and type information of a molecule. It is the minimal amount of information needed from a molecular file to create a grid. Grids are generated on the fly as needed.

CHANG-Shaole commented 3 years ago

If I have a molecular file, (like .pdb file), how can I generate a .gninatypes file from it?

RMeli commented 3 years ago

To convert molecular files into .gninatypes files you can use the gninatyper utility program that is compiled with GNINA.

In case it is useful, you can find an example pipeline I used in gsoc19/mltraining, where ligand and receptor files are typed (and subsequently collected together into a .molcache2 file). [Please ignore comments about Singularity; those apply to my own setup.]

RMeli commented 3 years ago

For the error

ValueError: Missing molecular data in line: 1 set2/297/rec.gninatypes set2/297/docked_0.gninatypes # 0.757109 -9.987190

I think the problem might be caused by the fact that you were probably using a model that also requires affinity data (has_affinity: true defined in molgrid_data_param for the MolGridData layer), such as default2017, while your .types files only define the pose label.

Today I used a different model, it's can be run normally. Even though it's quite strange, finally it's at least running.

Does the model that can be run normally uses has_affinity: false? That would explain the previous error and why the other model is working fine.

CHANG-Shaole commented 3 years ago

Thank you for joining this issue! I think your explanation for the error should be correct. The model have the error are models/acs2018/default2017.model and models/crossdocked_paper/dense.model, while models/refmodel3/refmodel3.model is good to run.

For now, I am doing the visualization of a certain model (now it is models/refmodel3/refmodel3.model), to see the feature map of some layers. So I need to execute the model layer by layer. I know after the MolGridData layer, the .gninatypes will become a "grid" formate, but I don't know how it really do. (I'm not the field of Computational Chemisty.)

I'm trying to use the molgrid package to convert the .gninatypes to a "grid" type, but I am not familair with this package now. Could you give a light usage of how to do the convert task use this package, or do you have any other way to do this task?

dkoes commented 3 years ago

You can look through the libmolgrid tutorials; https://gnina.github.io/libmolgrid/tutorials.html

CHANG-Shaole commented 3 years ago

Thanks for the link. Today I used the tutorials to do the gridding, and then to do the visualization. It works well. But I also have a little uncertainty about my "grid" type.

The code I used is:

e = molgrid.ExampleProvider(data_root=datadir,balanced=True,shuffle=True)
e.populate(fname)

gmaker = molgrid.GridMaker()
dims = gmaker.grid_dimensions(e.num_types())
tensor_shape = (batch_size,)+dims

input_tensor = torch.zeros(tensor_shape, dtype=torch.float32)
float_labels = torch.zeros(batch_size, dtype=torch.float32)

batch = e.next_batch(batch_size)
gmaker.forward(batch, input_tensor, 0, random_rotation=False)
input_numpy = input_tensor.numpy()

The data_root points to a "csar/alltest0.types" folder, after I run the above code, I found input_numpy.shape = (50, 28, 48, 48, 48), while after the MolGridData layer, the shape should be (50, 32, 48, 48, 48), so I just appended the channels by hand.

But I really don't know where the 28 (e.num_types() = 28) come from, could you explain me about this?

dkoes commented 3 years ago

That's the number of atom types. The default with molgrid is 14 receptor and 14 ligand types. You can inspect what they are by calling get_type_names.

CHANG-Shaole commented 3 years ago

Thanks for the explanation!

As I described yesterday, in order to adapt the model input, I copied some values of the channels to make the tensor from (50, 28, 48, 48, 48) to (50, 32, 48, 48, 48). But you said the channel means the different atom types, so I believe if I just copy some channels, it will be not suitable, (even it is a mistake).

So could you tell me a suitable way to change the original tensor shape (50, 28, 48, 48, 48) to the input shape (50, 32, 48, 48, 48)? :)

dkoes commented 3 years ago

If you want a different array of atom types, configure the molgridder to provide them.

CHANG-Shaole commented 3 years ago

Thanks a lot for your help. Actually, for the dataset, I have another question. Maybe in the field of Computational Chemistry, it's quite simple.

For a line data in the .types files: such as: "1 set2/297/rec.gninatypes set2/297/docked_0.gninatypes # 0.757109 -9.987190"

From the "README.md", I know the second and third values set2/297/rec.gninatypes, set2/297/docked_0.gninatypes are the path of receptor and ligand, and the first value is the "label", (for the "label", my understanding is, if it's 1, which means the receptor and the ligand are binders; if it's 0, which means the receptor and the ligand are non-binders). However, for the fourth value 0.757109, and the fifth value -9.987190. I think it's about the binding pose information, but I don't know what they are really mean.

So could you please tell me something about the .types file? (especially the fourth and fifth values) :)

CHANG-Shaole commented 3 years ago

To convert molecular files into .gninatypes files you can use the gninatyper utility program that is compiled with GNINA.

In case it is useful, you can find an example pipeline I used in gsoc19/mltraining, where ligand and receptor files are typed (and subsequently collected together into a .molcache2 file). [Please ignore comments about Singularity; those apply to my own setup.]

The aim I want to generate a .gninatypes file is to use the molgrid package to generate the corresponding gridding file, but this gridding file should have a whole field view of the protein, not just a part of it. (I think you set the standard size (or default size) to (24Å, 24Å, 24Å) with the resolution 0.5Å, so the grid size is (48, 48, 48))

Such as, if a protein length is about 60Å, what I want to do is to generate a grid size (120, 120, 120) with the default resolution of 0.5Å, to completely present the protein.

I wanna know if you have a suitable way to do this task? :)

Besides, I find the gnina is really powerful! Thanks a lot!!!

RMeli commented 3 years ago

So could you please tell me something about the .types file? (especially the fourth and fifth values)

The .types file contains information for training and the exact structure depends on the model definition (your problem above was caused by the model requiring affinity information, which was not present in the .types file).

For the particular files you are using, the label distinguish low and high RMSD poses (compared to the known crystal structure). A pose with RMSD less than 2A is labelled as positive (correct pose) while a pose with RMSD higher than 4A is labelled as negative. You can find the details here or here. The values after the receptor and ligand files are commented out and therefore they are not used for training/inference. However, the first value represent the RMSD of the pose (you can see that for poses labelled as 1 this value is always lower than 2A; this value was used for labelling), while the other value is the Vina score (docking score) assigned to the pose by the docking software used to generate it.

The aim I want to generate a .gninatypes file is to use the molgrid package to generate the corresponding gridding file, but this gridding file should have a whole field view of the protein, not just a part of it.

.gninatypes files still contain the whole protein.

GNINA/libmolgrid only require the atom types and their position in order to compute the atomic densities. Therefore, most of the information contained in PDB files (such as residue number, residue name, ...) is not used and only slows down I/O operations. .gninatypes files are pre-processed files that only contain the information that GNINA/libmolgrid need, in binary format.

If you want to change the size and the resolution of the grid, you either have to change the model definition (change the molgrid_data_param of the MolGridData layer) if you are using GNINA or play around with the parameters of GridMaker if you are using libmolgrid.

dkoes commented 3 years ago

I'll emphasize that you don't have to use a gninatypes files - you can provide a regular molecule file. This will be slower, but is fewer steps to getting data into your model.

CHANG-Shaole commented 3 years ago

Thanks a lot for such a detailed explanation. Actually, I want to use the molgrid package to generate a grid-type molecular file. (or say it's a voxel type, in that case, it will be easier to further process the protein)

Thanks to you, I have generated the grid-type of a protein, with 28 atom types/channels. With the help of the visualization of pymol, I found that the grid protein (channel 1 - should be AliphaticCarbonXSHydrophobe) is just like the protein surface in pymol. That's wonderful.

I believe if I make all the atom types into just one channel, then the grid protein will be more like the protein surface in pymol. Moreover, I found the dimension, which can just show all views of the protein, is also hard to find.

So there are another two general questions,

  1. How to make the default 28 channels (14 for receptor, 14 for ligand) into only two;
  2. How to easily find a dimension for a certain protein (the whole view).

The code I used for generating the 28 default channels:

# use the libmolgrid ExampleProvider to obtain shuffled, balanced, and stratified batches from a file
e = molgrid.ExampleProvider(data_root= datadir)
e.populate(fname)
# initialize libmolgrid GridMaker
gmaker = molgrid.GridMaker(resolution= 7, dimension= 70) #, radius_type_indexed=True)
dims = gmaker.grid_dimensions(e.num_types())
# dims = gmaker.grid_dimensions(1)
tensor_shape = (batch_size,) + dims
tensor_shape

I believe it should be done by the modifies of molgrid.GridMaker and molgrid.ExampleProvider, but I can't find more useful information on the internet ...

CHANG-Shaole commented 3 years ago

Besides, I also use a resolution of 0.5Å and dimension size 59.5Å to generate a relatively clear structure of the protein.

gmaker = molgrid.GridMaker(resolution= 0.5, dimension= 59.5)

So the generated grid size is (120, 120, 120).

What I want to do is to create an "equipotential surface" from these grid's values. I want to use this operation to show the protein pockets information more clearly. But I found that there are 278612 unique values in the grid (it is too much), just from 0.0 to 1.6504064.

Does it make sense if I create the "equipotential surface" by using a certain grid value or an interval of values (such as from 1.2 to 1.5)?

dkoes commented 3 years ago

You can use a FileMappedGninaTyper and provide it a file where all the gnina types are on one line so there is only a single atom type, although that obviously will throw out a lot of information. If you don't want Gaussian densities, you can set binary=True in the GridMaker class.

CHANG-Shaole commented 2 years ago

Thank you so much for the help in the past few weeks !!! @dkoes @RMeli I was offline last week. Recently, I was learning your new paper. It's more powerful than the older one.

I have a question about the model. From what I knew, in the new stage, for training, you used a joint model for both pose selection (classification) and affinity prediction (regression). The CNN input is the complex grid image (the 3D grid "image" with 28 atom types) and the CNN output is some characteristic values for pose and affinity.

But how do we predict a new binding pose and affinity using the model, because I think the input should be the protein and ligand without their relative position. But the output should be a certain binding position for the protein and ligand, not a value. So I'm just confused how did you do that?

CHANG-Shaole commented 2 years ago

I also have another question about affinity prediction. In your paper, the pK was used as the evaluation metric of affinity prediction, I believe the pK value can represent the binding degree of the protein-ligand. But when you described the affinity prediction result, you mentioned the Pearson coefficient (Pearson R) a lot of times.

I am a layman in this field, but could you try to explain to me what's the meaning of the Pearson R in that state (maybe pK also)?

Sincerely

RMeli commented 2 years ago

But how do we predict a new binding pose and affinity using the model, because I think the input should be the protein and ligand without their relative position. But the output should be a certain binding position for the protein and ligand, not a value. So I'm just confused how did you do that?

The CNN scoring function takes a binding pose as input and outputs a score (in the newer models, both the pose score and the binding affinity are produced). This is only one ingredient of molecular docking (generating different binding poses and rank them according to a scoring function). In order to perform molecular docking, you can use GNINA.

Within GNINA there are already pre-trained models, therefore it is good to go. You can read on the README how to perform molecular docking, but you might interested in this excellent tutorial by David and the associated notebook, that explains how to perform docking with GNINA with the pre-trained models (but not how to train your own models).

However, you can train your own models with the train.py script in this repository and use your own model and weight in GNINA using the options --cnn_model and --cnn_weights:

  --cnn_model arg                  caffe cnn model file; if not specified a 
                                   default model will be used
  --cnn_weights arg                caffe cnn weights file (*.caffemodel); if 
                                   not specified default weights (trained on 
                                   the default model) will be used

The CNN scoring function can be used at different stages of docking. Figure 1 in this paper show how the CNN scoring function can be used in different stages of the docking pipeline (by default is rescore, but you can change that with the --cnn_scoring option of GNINA).

I am a layman in this field, but could you try to explain to me what's the meaning of the Pearson R in that state (maybe pK also)?

The pK is -log10(K) where K is the inhibition or dissociation constant (the logarithm is there because the values of K usually span many orders of magnitude). For some complexes, these are known experimentally (for example from the PDBbind database) and such experimental values are used to train the models (together with the pose label).

The Pearson's correlation coefficient is used to assess the (linear) correlation between the known experimental values and the values predicted by the CNN model.

CHANG-Shaole commented 2 years ago

Thanks a lot for such a clear explanation! It indeed helps me a lot. @RMeli @dkoes I also read the paper recently. So for now, my understanding is that "The role of the trained CNN model (including the weights) is a scoring function, like Vina. While the conformation of small molecular is generated by the Monte Carlo Chain, the CNN model is used to select those better conformation". I don't know if my understanding is correct, if not, please modify it. In the meantime, I am also learning the excellent presentation, so I wanna know if you can share your slides. It will help me understand more quickly. If not, it's also ok. :)

RMeli commented 2 years ago

Yes, with the --cnn_scoring rescore option (which is the default one) GNINA uses the CNN to re-score the poses obtained by Monte Carlo search (in the same way that smina/AutoDock Vina do).

However, you can use the CNN scoring function at other stages of the docking pipeline, as shown in Figure 1 of the GNINA 1.0 paper. With --cnn_scoring refine the CNN scoring function is also used for refinement of the poses obtained by Monte Carlo search. With --cnn_scoring all, the CNN scoring function can also be used to guide the Monte Carlo search, but this is way too slow in practice.

I believe you can find David's slides here.

CHANG-Shaole commented 2 years ago

After reading the cross docking paper, and under Prof. David's recommendation (several weeks ago), I feel that I need also to train a model with the Cross Docked 2020 dataset. For now, I have downloaded the CrossDocked2020_types.tar.gz and CrossDocked2020.tgz. After unzipping these files, I believe I have both the .types and .gninatypes files.

There is a line data in the .types file: 0 -0 7.14898 1A02_HUMAN_25_199_pep_0/1ao7_A_rec_0.gninatypes 1A02_HUMAN_25_199_pep_0/1ao7_A_rec_2gj6_3ib_lig_it1_tt_docked_0.gninatypes #30.8685

I believe it has both pose and affinity data. But indeed for a certain value of the data, I'm not sure what it means. So could you please briefly describe it to me? Besides, I can't find the receptor file (.gninatypes) in the corresponding folder, such as 1ao7_A_rec_0.gninatypes, but I can find the 1ao7_A_rec_2gj6_3ib_lig_it1_tt_docked_0.gninatypes. I'm just confused why it is?

CHANG-Shaole commented 2 years ago

Recently I was also learning Prof. David's presentation and GNINA 1.0 paper. The presentation is exactly wonderful. But because it's not my field, I still have several questions. That is how to use the Monte Carlo Chain to search ligand conformation. In addition, I noticed that you use a classification task for the pose selection. So that the output of the pose score should be 0 or 1. But in the GNINA software, it outputs a CNNscore (from 0 to 1) for the possibility of the good pose (RMSD < 2Å). I think this should be a regression task output, not a classification. So I'm very confused about that, so could you please explain to me something about it?

dkoes commented 2 years ago

Monte Carlo search is built into gnina. It is the sampling process that drives docking. The degrees of freedom of the ligand (translation, rotation, and torsion angles) are randomly mutated and evaluated. New poses are retained according to the Metropolis criteria.

The labels of a classification task are binary, but the predicted values need not be. The binary cross entropy loss works with binary labels and a numerical score for the prediction. This is very standard.

CHANG-Shaole commented 2 years ago

The labels of a classification task are binary, but the predicted values need not be. The binary cross entropy loss works with binary labels and a numerical score for the prediction. This is very standard.

Okay, thanks for the reply. that's it. I believe that I asked a stupid question, it's my field ...

CHANG-Shaole commented 2 years ago

Sorry to bother you again! @RMeli @dkoes

Recently I tried to use the CrossDock2020 dataset to do the model training. As instructed, there are totally three files I downloaded from the link, CrossDocked2020.tgz, CrossDocked2020_receptors.tgz, CrossDocked2020_types.tar.gz. To do the model training, I unzipped the first two files in the same folder. In addition, after unzipped the CrossDocked2020types.tar.gz, I used the it0 series_ as the training set. The model I used was the dense model in the models/crossdocked_paper.

When I started the training process, I encountered a mistake, A lot of print lines showed Dropping receptor with no actives/decoys, after that, I got a ValueError: No valid stratified examples.

Could you give me some clues about this error?

Respectfully!

dkoes commented 2 years ago

You have stratification on but the identified strata do not have both actives and decoys. The dense model does not have has_rmsd set to true but the types file you are using has an rmsd column, so it is interpreting the rmsd field as the receptor. You need to set has_rmsd to true in the dense model.

CHANG-Shaole commented 2 years ago

Thanks for your useful advice. @dkoes @RMeli

I'm not very sure where to set the has_rmsd to true. So compared to the default2017.mode in the crossdocked_paperl, I add a line "has_rmsd: true" to the dense.model. The add location is after the "has_affinity: true" line. But after I did that, there is another error:

I1020 02:14:12.570433 388211 layer_factory.hpp:77] Creating layer data
I1020 02:14:12.570487 388211 net.cpp:85] Creating Layer data
I1020 02:14:12.570493 388211 net.cpp:385] data -> data
I1020 02:14:12.570508 388211 net.cpp:385] data -> label
I1020 02:14:12.570513 388211 net.cpp:385] data -> affinity
F1020 02:14:12.570523 388211 layer.hpp:372] Check failed: ExactNumTopBlobs() == top.size() (4 vs. 3) MolGridData Layer produces 4 top blob(s) as output.
*** Check failure stack trace: ***
Aborted (core dumped)

I think I added the has_rmsd to the wrong place. In addition, when I used the default2017.model in the crossdocked_paper, it seems to train normally. Except a ValueError: Could not read /data/2020_1018/scripts-master/models-master/data/Docked2020_1018/gninatypes/ABCBA_HUMAN_151_738_0/4ayt_A_rec_0.gninatypes

So I checked the receptor folder, it indeed didn't have this file. So I want to know if there are some receptor files missing?

CHANG-Shaole commented 2 years ago

Besides, I'm also very interested in the Monte Carlo Chain Sampling methods. But it was integrated into GNINA so that I couldn't learn too many details. Could you give me some links or script examples about how you do the ligand/receptor sampling using the MC?

The GNINA software is really powerful! Respectfully.

RMeli commented 2 years ago

I think I added the has_rmsd to the wrong place.

I think you added it to the correct place (same place where has_affinity: true is located). However, it is not the only modification you need to do (depending on if/how you want to use the RMSD); I suggest to search for "rmsd" in one of the models that uses it.

The error

Check failed: ExactNumTopBlobs() == top.size()

is likely caused by the fact that you did not add a RMSD blob to the MolgrdiDataLayer: see this line

Besides, I'm also very interested in the Monte Carlo Chain Sampling methods.

You can find thee implementation of Monte Carlo sampling in GNINA in gninasrc/lib/monte_carlo.h.

If you are just interested in the general idea, any book on molecular simulation usually has a section on Monte Carlo methods:

CHANG-Shaole commented 2 years ago

Thanks for your reply.

I think recently I have learned the general idea of the Monte Carlo method. But I'm also very curious about how you use the Monte Carlo method to generate the different ligand conformation (the implementation details).

I also tried to read the source code from today, including monte_carlo.h & monte_carlo.cpp. It's difficult to learn from the source code, that's why I asked you about the .py script.

I want to use a script to implement: input a protein/ligand and a search space, to output several complexes (protein with different ligand conformation). I think it is to simulate the Monte Carlo sampling process.

I mean if you have some scripts used to do that task before or some examples, that I can learn the details from them. If you haven't, I will keep learning the source code, even it's very time-consuming.

CHANG-Shaole commented 2 years ago

Because I think the sampling and scoring are both important. From your scripts, I have learned how to train a CNN scoring function. But I have no idea about how to do the Monte Carlo sampling process.

I think the sampling and scoring should be non-splittable. That's the reason I want also to learn the sampling details.

Indeed, very thank you for the help these weeks! Respectfully.

dkoes commented 2 years ago

There is no script for MC sampling.

CHANG-Shaole commented 2 years ago

Okay... Thank you for your reply anyway. I will try to find it out myself.