hkmztrk / DeepDTA

215 stars 107 forks source link

What is the use of two similarity files in the data? #7

Closed zhouhao-learning closed 5 years ago

zhouhao-learning commented 5 years ago

Hello, I saw your dataset introduction and ran your code. First, I confirmed that, according to the information I understand, your method is actually based on the smiles-protein sequence as input, so it is not used. Drug structure information or protein structure information, right?

In addition, I did not see the use of kiba_drug_sim.txt and kiba_target_sim.txt files in the code, that is, these two files are redundant and useless?

Maybe I don't fully understand your method, please also give me a lot of guidance, thank you!

hkmztrk commented 5 years ago

Hello @zhouhao-learning, yes, our method only uses sequence input.

Those similarity files were used in two additional tests that are explained in the paper, please refer to Tables 3-4. You'll see while CNN was used to build representation for one component (e.g. SMILES), for the other we used the similarity matrix (e.g. kiba_target_sim). With this, we wanted to understand how much information CNN brings in.

In the source code, these correspond to build_single_prot and build_single_drug functions, but it seems I didn't include the code that is required to run them. I can update the code if they will be useful for you.

Best.

zhouhao-learning commented 5 years ago

@hkmztrk Ok, thank you very much for your reply, I think, you can update this part of the code, it will be better to try for me, In addition, does the kiba_target_sim.txt or kiba_drug_sim.txt file use structural information about proteins or drugs? If I have some SMILES and proteins, is there any way to convert them to similarity matrix? Please also give us a lot of advice, thank you very much! Best Wishes!

hkmztrk commented 5 years ago

@zhouhao-learning for target similarity, smith-waterman algorithm is used, for drug similarity Pubchem structure similarity is used. Please refer to the DeepDTA article for more detail.

I'll try to update the code in a few days.

Best!

zhouhao-learning commented 5 years ago

@hkmztrk Ok, thank you for your reply, I have one last question,

CHARPROTSET = { "A": 1, "C": 2, "B": 3, "E": 4, "D": 5, "G": 6, 
                "F": 7, "I": 8, "H": 9, "K": 10, "M": 11, "L": 12, 
                "O": 13, "N": 14, "Q": 15, "P": 16, "S": 17, "R": 18, 
                "U": 19, "T": 20, "W": 21, 
                "V": 22, "Y": 23, "X": 24, 
                "Z": 25 }

CHARCANSMISET = { "#": 1, "%": 2, ")": 3, "(": 4, "+": 5, "-": 6, 
             ".": 7, "1": 8, "0": 9, "3": 10, "2": 11, "5": 12, 
             "4": 13, "7": 14, "6": 15, "9": 16, "8": 17, "=": 18, 
             "A": 19, "C": 20, "B": 21, "E": 22, "D": 23, "G": 24,
             "F": 25, "I": 26, "H": 27, "K": 28, "M": 29, "L": 30, 
             "O": 31, "N": 32, "P": 33, "S": 34, "R": 35, "U": 36, 
             "T": 37, "W": 38, "V": 39, "Y": 40, "[": 41, "Z": 42, 
             "]": 43, "_": 44, "a": 45, "c": 46, "b": 47, "e": 48, 
             "d": 49, "g": 50, "f": 51, "i": 52, "h": 53, "m": 54, 
             "l": 55, "o": 56, "n": 57, "s": 58, "r": 59, "u": 60,
             "t": 61, "y": 62}

CHARISOSMISET = {"#": 29, "%": 30, ")": 31, "(": 1, "+": 32, "-": 33, "/": 34, ".": 2, 
                "1": 35, "0": 3, "3": 36, "2": 4, "5": 37, "4": 5, "7": 38, "6": 6, 
                "9": 39, "8": 7, "=": 40, "A": 41, "@": 8, "C": 42, "B": 9, "E": 43, 
                "D": 10, "G": 44, "F": 11, "I": 45, "H": 12, "K": 46, "M": 47, "L": 13, 
                "O": 48, "N": 14, "P": 15, "S": 49, "R": 16, "U": 50, "T": 17, "W": 51, 
                "V": 18, "Y": 52, "[": 53, "Z": 19, "]": 54, "\\": 20, "a": 55, "c": 56, 
                "b": 21, "e": 57, "d": 22, "g": 58, "f": 23, "i": 59, "h": 24, "m": 60, 
                "l": 25, "o": 61, "n": 26, "s": 62, "r": 27, "u": 63, "t": 28, "y": 64}

What is the definition of the index values corresponding to these symbols? I see that they are not defined according to the order of the letters. Is there any way?

hkmztrk commented 5 years ago

Hello @zhouhao-learning, sorry I couldn't find the time to update to code, last few weeks have been hectic.

As for your question, no actually, the numerical IDs assigned everytime a new character is detected in the corpus.

zhouhao-learning commented 5 years ago

@hkmztrk What is the definition of the index values corresponding to these symbols? I see that they are not defined according to the order of the letters. Is there any way?

hkmztrk commented 5 years ago

@zhouhao-learning,

No, there is not a correspondence behind these numerical id assignments, they are random.

zhouhao-learning commented 5 years ago

@hkmztrk OK! Thank You