Instructions on how to make the pre-made files in the directory "data"

lishuya17 / MONN

MONN: a Multi-Objective Neural Network for Predicting Pairwise Non-Covalent Interactions and Binding Affinities between Compounds and Proteins

100 stars 32 forks source link

Instructions on how to make the pre-made files in the directory "data" #2

Closed arwhirang closed 4 years ago

arwhirang commented 4 years ago

Hello

In the preprocessing_and_clustering.py code, I could see the pre-made files in the directory "data" are used. However, it seems that some of the files are generated from different pdb data.

Traceback (most recent call last): File "preprocessing_and_clustering.py", line 322, in idx_list = [ori_protein_list.index(pid) for pid in protein_list] File "preprocessing_and_clustering.py", line 322, in idx_list = [ori_protein_list.index(pid) for pid in protein_list] ValueError: 'O85638' is not in list

Can you explain how to make the pre-made files? There are 5 files in the data:

mol_dict
out7_final_pairwise_interaction_dict
pdbbind_all_datafile.tsv
pdbbind_protein_list.npy
pdbbind_protein_sim_mat.npy

Thank you for the great work.

lishuya17 commented 4 years ago

Please refer to "/create_dataset/Dataset_construction_protocol.txt" for the detailed dataset construction instruction. Hope this will help solve your problem.

arwhirang commented 4 years ago

Well, I am currently looking into the "Dataset_construction_protocol.txt" file. I could find that the following three files are generated during the dataset construction protocol process.

mol_dict (this can be newly generated in the pre-processing)
out7_final_pairwise_interaction_dict
pdbbind_all_datafile.tsv

However, I still require the following two files:

pdbbind_protein_list.npy
pdbbind_protein_sim_mat.npy

Can you help me?

lishuya17 commented 4 years ago

The protein similarity matrix can be calculated from .fasta files containing the protein sequences. Suppose we have n sequences. First, calculate the sequence alignment scores for all the n×n combinations using /create_dataset/smith-waterman-src/pyssw.py (python pyssw.py -p seqs.fasta seqs.fasta > output.txt). Then, normalize the alignment scores to obtain the similarity matrix of size n×n by sim[i,j] = sw(i,j)/sqrt(sw(i,i)*sw(j,j)), where i=1,...,n, j=1,...,n and sw(i,j) is the optimal alignment score between sequence i and j. The protein list is a record of the UniProt IDs of the n proteins corresponding to the rows/columns of the similarity matrix.

arwhirang commented 4 years ago

Ok, thank you for the answer. I will try as you suggested.