Inquiry about bindingdb dataset and models

mmagithub commented 7 months ago

Hello, Nice repo, I am wondering if you can share pre-trained models that were trained on the full bindingdb that we can use directly for prediction. I am also wondering if you have done any curation/exclusion of the bindingDB records. The original number of records in the BindingDB (2.8M records) is significantly larger than the number of records you showed in the manuscript (~62k).

https://www.bindingdb.org/rwd/bind/index.jsp

Looking forward for your reply,

Thanks, Marawan

mmagithub commented 7 months ago

It seems you may have used the classification subset from BindingDB curated by (Gao K, et al (2018) Interpretable drug target prediction using deep neural representation 3371–3377. https://doi.org/10.24963/ijcai.2018/468), please confirm if this is true, and if you have tested the method against regression problems as well.

wangmeng-code commented 7 months ago

Dear Marawan:

Thank you for your interest in the manuscript "A bidirectional interpretable compound-protein interaction prediction framework based on cross attention". You can access the CmhAttCPI pre-trained using the BindingDB dataset curated by us from the BindingDB database at https://github.com/wangmeng-code/wangmeng. You can directly use it for predictions on your dataset.

Regarding the dataset collection, indeed, the original data was larger than that we curated. This is because we organized and cleaned the raw data. We specifically chose human-related compound protein interactions (CPIs), and select KD values greater than 10,000 nm as negative samples and those less than 10,000 nm as positive samples. Due to the large volume of data, we utilized our workstation to process However, our workstation is currently undergoing maintenance, and access to the relevant records is unavailable.

Best Regards

At 2024-04-08 23:17:30, "mmagithub" @.***> wrote:

Hello, Nice repo, I am wondering if you can share pre-trained models that were trained on the full bindingdb that we can use directly for prediction. I am also wondering if you have done any curation/exclusion of the bindingDB records. The original number of records in the BindingDB (2.8M records) is significantly larger than the number of records you showed in the manuscript (~62k).

https://www.bindingdb.org/rwd/bind/index.jsp

Looking forward for your reply,

Thanks, Marawan

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

wangmeng-code commented 7 months ago

Dear Marawan:

The BindingDB dataset in "A bidirectional interpretable compound-protein interaction prediction framework based on cross attention" was curated by us from the BindingDB database. We specifically chose human-related compound protein interactions (CPIs), and select KD values greater than 10,000 nm as negative samples and those less than 10,000 nm as positive samples. We did not use the dataset curated by (Gao K, et al (2018) Interpretable drug target prediction using deep neural representation 3371–3377. https://doi.org/10.24963/ijcai.2018/468)

Regards

在 2024-04-08 23:44:33，"mmagithub" @.***> 写道：

It seems you may have used the classification subset from BindingDB curated by (Gao K, et al (2018) Interpretable drug target prediction using deep neural representation 3371–3377. https://doi.org/10.24963/ijcai.2018/468), please confirm if this is true, and if you have tested the method against regression problems as well.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

mmagithub commented 7 months ago

Thanks so much for the answer and for providing the model.

mmagithub commented 7 months ago

A few more questions if you do not mind:

1 - When training the model, did you perform any preprocessing on the starting SMILES strings, such as uncharging, standardization, or canonicalization, or did you use the SMILES fields from BindingDB as-is?

2 - For the target protein sequences, did you use the raw sequences from UniProt, or did you select specific amino acid sequences corresponding to particular domains of interest, such as the catalytic domain of a kinase?

3 - is it possible to fine-tune ChemAttN on other datasets?

wangmeng-code commented 7 months ago

Dear Marawan:

Here are our answers to your questions:

We did not preprocess the SMILES strings, we only used the original strings from the BindingDB database before model training.
Similarly, original protein sequences from the UniProt database were also used.
You can fine-tune the model according to your specific task or the size of datasets.

Regards,

At 2024-04-10 02:37:28, "mmagithub" @.***> wrote:

A few more questions if you do not mind:

1 - When training the model, did you perform any preprocessing on the starting SMILES strings, such as uncharging, standardization, or canonicalization, or did you use the SMILES fields from BindingDB as-is?

2 - For the target protein sequences, did you use the raw sequences from UniProt, or did you select specific amino acid sequences corresponding to particular domains of interest, such as the catalytic domain of a kinase?

3 - is it possible to fine-tune ChemAttN on other datasets?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

mmagithub commented 7 months ago

Thanks for you answers. I have tried the model you shared and I needed to change this line in the model.py script: from: self.pro_embed = nn.Embedding(self.N_word, self.embed_dim) To: self.pro_embed = nn.Embedding(num_embeddings=8313, embedding_dim=128)

Possibly the settings on the github repo is different from what you used for training.

wangmeng-code commented 7 months ago

self.N_word needs to be modified, because during the embedding process, the model requires the total number of "word dictionary" constructed based on training dataset and your dataset. For more details, please refer to preprocessing_data.py.

At 2024-04-10 10:53:57, "mmagithub" @.***> wrote:

Thanks for you answers. I have tried the model you shared and I needed to change this line in the model.py script: from: self.pro_embed = nn.Embedding(self.N_word, self.embed_dim) To: self.pro_embed = nn.Embedding(num_embeddings=8313, embedding_dim=128)

Possibly the settings on the github repo is different from what you used for training.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

wangmeng-code / CmhAttCPI

Inquiry about bindingdb dataset and models #1