about Preprocessing of data

anny0316 commented 1 year ago

Hello, I have a question, when running the file get_fragment_vocab.py, the vocab of the fragment can be saved, why does the fragment need to be re-acquired in the file get_training_data.py, and align it with the previously saved fragment, and finally get the rotation matrix ? why do that? thank you very much.

longlongman commented 1 year ago

After running the file get_fragment_vocab.py, we can get the fragment vocabulary. Because we want to use the vocabulary to rebuild a 3D molecule, we align the molecule with the saved fragments. Specifically, a molecule first will be cut into fragments as we did in building vocabulary. Then we search for the molecule fragments from the vocabulary to get their vocabulary index. For each molecule fragment, after getting its index, we align the saved fragment (retrieved from the vocabulary according to the index) to get the corresponding translation vector and rotation matrix, so that we can apply the correct transformation to the saved fragment to rebuild the corresponding molecule fragment.

anny0316 commented 1 year ago

Thank you, I get it.

anny0316 commented 1 year ago

Hello, when I got its pocket by using CAVITY for a given protein, I found that there are many pockets for a protein. In this experiment, do all pockets detected for a protein need to be considered? thank you very much.

longlongman commented 1 year ago

For each target protein, we only use the cavity with the best drugability score (the score is also provided by CAVITY).

anny0316 commented 1 year ago

OK, many thanks to Siyu.

anny0316 commented 1 year ago

Hello Siyu,

I would like to know what is the difference between “thischains_vacant_xx.pdb” and “thischains_cavity_xx.pdb”, after I use CAVITY to generate the cavity PDB file? In our experiment, I think “thischains_cavity_xx.pdb” should be used, isn't it?
In sketching.py, in order to get "sample_n_o_f" for protein, the selected features are "for xyz in feature_dict[(7.0,)] + feature_dict[(8.0,)] + feature_dict[(9.0,)]", Why not choose another feature? such as feature_dict[(6.0,)] or other.

I'm looking forward for your reply. Thank you.

longlongman commented 1 year ago

Q: What is the difference between “thischains_vacant_xx.pdb” and “thischains_cavity_xx.pdb”? A: "thischains_vacant_xx.pdb" gives us the cavity in the volume form (including surface). “thischains_cavity_xx.pdb” gives us the surface of the cavity.

Q: “thischains_cavity_xx.pdb” should be used, isn't it? A: No, we use “thischains_vacant_xx.pdb”, because we need to calculate the volume of sampled molecular shapes.

Q: Why not choose another feature? A: As mentioned in Appendix 2.2 Chemical Information Driven Design, we also explore the potential of integrating chemical information of proteins into drug design. Briefly speaking, based on hydrogen bond acceptor-donor rules, we put the fragment with more hydrogen atoms into the pocket region with more oxygen, nitrogen, and fluorine atoms (feature_dict[(7.0,)], feature_dict[(8.0,)], feature_dict[(9.0,)).

Moreover, you can check out the issue https://github.com/longlongman/DESERT/issues/2, where I already answer some common questions.

anny0316 commented 1 year ago

Hello Siyu, thank you very much.

longlongman / DESERT

about Preprocessing of data #3