Dataset interpretation - Githubissues

hkmztrk commented 5 years ago

Hello,

Is there a file that maps the following data into the corresponding IDs?

drug,target,value,fold 1,1,3.799999907,1

For instance the IDs for drug 1 and target 1?

Thank you

simonfqy commented 5 years ago

The drugs do not have IDs; all we need of them is their smiles representation, which can be found in the restructured.csv in each of the dataset folders like /davis_data/. The target proteins' keys are proteinName and protein_dataset (as can be seen also in the restructured.csv files, and each (proteinName, protein_dataset) tuple correspond to a record in the prot_desc.csv file in each of the dataset folders, from which you will be able to retrieve the protein sequence and PSC descriptor. The folds division are not present in csv files. You can refer to https://github.com/simonfqy/PADME/blob/0d38d30e4f3b14002f29841dd228f28519a2583c/dcCustom/splits/splitters.py#L72 to see how it is done and stored.

hkmztrk commented 5 years ago

Thank you. I actually want to test my model on your test/train split, so I need at least protein IDs to extract extra information. Can we say that prot_desc.csv file correspond to the target numbers in the previous example? Such as O15530 corresponding the target 1 ?

simonfqy commented 5 years ago

You can encode it in this way. Note that the key for each protein is a (proteinName, protein_dataset) tuple. Currently my code for splitting in /splits/splitters.py is very long and spaghetti-like. I will refactor it over the weekend to make it much more manageable and less error-prone. You're welcome to start right away without waiting for my update, though.

simonfqy commented 4 years ago

Feel free to reopen it if you have further questions.

simonfqy / PADME

Dataset interpretation #12