Closed hkmztrk closed 4 years ago
The drugs do not have IDs; all we need of them is their smiles representation, which can be found in the restructured.csv
in each of the dataset folders like /davis_data/
. The target proteins' keys are proteinName
and protein_dataset
(as can be seen also in the restructured.csv
files, and each (proteinName, protein_dataset)
tuple correspond to a record in the prot_desc.csv
file in each of the dataset folders, from which you will be able to retrieve the protein sequence and PSC descriptor.
The folds division are not present in csv
files. You can refer to https://github.com/simonfqy/PADME/blob/0d38d30e4f3b14002f29841dd228f28519a2583c/dcCustom/splits/splitters.py#L72 to see how it is done and stored.
Thank you. I actually want to test my model on your test/train split, so I need at least protein IDs to extract extra information. Can we say that prot_desc.csv file correspond to the target numbers in the previous example? Such as O15530 corresponding the target 1 ?
You can encode it in this way. Note that the key for each protein is a (proteinName, protein_dataset)
tuple. Currently my code for splitting in /splits/splitters.py
is very long and spaghetti-like. I will refactor it over the weekend to make it much more manageable and less error-prone. You're welcome to start right away without waiting for my update, though.
Feel free to reopen it if you have further questions.
Hello,
Is there a file that maps the following data into the corresponding IDs?
drug,target,value,fold 1,1,3.799999907,1
For instance the IDs for drug 1 and target 1?
Thank you