Question regarding to DAVIS dataset

pykao commented 3 years ago

Hi Kexin,

For the DAVIS dataset, it has 68 drugs, 379 protein, and 30,056 interactions. It looks wired to me. If there are only one interaction between one drug and one protein, the maximum number of interaction would be 68x379 = 25,772. How can we have more than 25,772 interactions?

Best, Po-Yu

kexinhuang12345 commented 3 years ago

Hi Ken, interesting, i will do some digging into it and get back to you (prob in the weekend or early next week, catching a school due). In the meantime, checkout the raw data source: http://staff.cs.utu.fi/~aatapa/data/DrugTarget/

pykao commented 3 years ago

Any update on this issue?

kexinhuang12345 commented 3 years ago

Hi, sorry for the late update. Yes, I just checked the original data format, it is of length 68 x 442 = 30056. But the original data format only provided gene name. So we used the processed data from the DeepDTA paper: https://github.com/hkmztrk/DeepDTA/blob/master/data/davis/proteins.txt; it seems the different target name may result in the same amino acid sequence due to some of their processing issue. That's why when we type unique(target_seq), there is only 379 showing up.

I then did some checking on the repetitive sequences, it seems like they indeed come from the same gene but with different biomarkers (e.g. ABL1(F317I), ABL1(F317I)p). So as the amino acid sequence for each gene is the same, they are included in the dataset. This means that the data assay value is indeed still valid, there are just several repetitive protein sequences.

One idea would be to test on cold protein setting for robustness in addition to the random split, which is still valid.

pykao commented 3 years ago

Thank you for the clarification.

kexinhuang12345 / DeepPurpose

Question regarding to DAVIS dataset #59