How to accumulate Target protein's amino acid sequence (t) and drug's SMILES strings (d)

kexinhuang12345 / DeepPurpose

A Deep Learning Toolkit for DTI, Drug Property, PPI, DDI, Protein Function Prediction (Bioinformatics)

https://doi.org/10.1093/bioinformatics/btaa1005

BSD 3-Clause "New" or "Revised" License

970 stars 272 forks source link

How to accumulate Target protein's amino acid sequence (t) and drug's SMILES strings (d) #51

Closed mislam5285 closed 3 years ago

mislam5285 commented 3 years ago

I am novice in DTI research. I want to know how to get : an array of drug's SMILES strings (d), an array of target protein's amino acid sequence (t) . In order to learn "Tutorial_1_DTI_Prediction"

Suppose I have found the following using DrugBank data: Drug ID Target ID Score

DB08604 P0AEK4 0.931528 DB07181 P0AEK4 0.931504 DB08642 P16184 0.931335 DB03233 P0A884 0.931334 DB07411 P0AEK4 0.931313 DB07209 P27338 0.931300 DB03072 P0AEK4 0.931230 DB02727 Q9Y296 0.931186 DB06840 Q9Y296 0.931151 DB07972 P0AEK4 0.931095 DB08700 P0AEK4 0.931029 DB07647 P0AEK4 0.931003 DB01861 P96945 0.930968 ........................................... ............................................

Questions: 1.How to get target protein's amino acid sequence (t) for large no of Target ID 2.How to get drug's SMILES strings for the large no of Drug IDs

hima111997 commented 3 years ago

Hi @mislam5285, For proteins, You can use this tool to get the sequence from the ID: https://www.uniprot.org/uploadlists/ Then you can write a simple python script to replace the IDs in the example you gave to the corresponding sequences

For drugs, if you have downloaded the drugbank database in sdf format, you can convert it to smiles using OpenBabel (either the whole database or the drugs that correspond to the DB IDs using python script). Then using a similar script to the first one you can replace the DB IDs

I hope that helps

mislam5285 commented 3 years ago

Hi @mislam5285, For proteins, You can use this tool to get the sequence from the ID: https://www.uniprot.org/uploadlists/ Then you can write a simple python script to replace the IDs in the example you gave to the corresponding sequences

For drugs, if you have downloaded the drugbank database in sdf format, you can convert it to smiles using OpenBabel (either the whole database or the drugs that correspond to the DB IDs using python script). Then using a similar script to the first one you can replace the DB IDs

I hope that helps

Thank you Sir.

kexinhuang12345 commented 3 years ago

Thanks @hima111997, in addition, you can find here for pythonic way to retrieve uniprot sequence. However, this is relatively slow, thus, for large batch of queries, please visit https://www.uniprot.org/uploadlists/

mislam5285 commented 3 years ago

Thanks @hima111997, in addition, you can find here for pythonic way to retrieve uniprot sequence. However, this is relatively slow, thus, for large batch of queries, please visit https://www.uniprot.org/uploadlists/

Thank you Sir

xuzhang5788 commented 3 years ago

Are there any methods to change protein sequences to protein names? Because I want to modify KIBA dataset. I want to add some proteins to the dataset, so I want to know if the new proteins are already in the KIBA dataset. But if I use the protein sequence to compare proteins, it will be difficult.

mislam5285 commented 3 years ago

Dear Sir, I don't know how to change protein sequences to protein names for KIBA dataset. I have one mapping for drugbank DB that I computed for my work. I am attaching that file and it may or may not help you.

On Mon, Jan 4, 2021 at 7:08 AM xuzhang5788 notifications@github.com wrote:

Are there any methods to change protein sequences to protein names? Because I want to modify KIBA dataset. I want to add some proteins to the dataset, so I want to know if the new proteins are already in the KIBA dataset. But if I use the protein sequence to compare proteins, it will be difficult.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kexinhuang12345/DeepPurpose/issues/51#issuecomment-753715233, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFHFTUPFTUFIYG67VDO3TLSYELZTANCNFSM4UC22EPQ .

kexinhuang12345 commented 3 years ago

Hi, here is the link for the protein Uniprot ID and the sequence: https://github.com/hkmztrk/DeepDTA/blob/master/data/kiba/proteins.txt