mims-harvard / TDC

Therapeutics Commons (TDC-2): Multimodal Foundation for Therapeutic Science
https://tdcommons.ai
MIT License
1.01k stars 174 forks source link

Add pointers to data processing script & ChEMBL dataset update #135

Open AlanHassen opened 2 years ago

AlanHassen commented 2 years ago

The overall question: Is it possible to describe the preprocessing and data origin for different datasets?

Explanation: I am currently looking into using ChEMBL via TDC. However, it would be essential to know which version of the dataset is provided here and how it is preprocessed (cleaned...) for reproducibility purposes. This is especially important because ChEMBL has recurring releases. In the source code, the file is downloaded from "https://dataverse.harvard.edu/api/access/datafile/" without any explanation (TDC/tdc/utils/load.py).

A Solution: Add a section data origin/preprocessing to the documentation.

kexinhuang12345 commented 2 years ago

Hi Alan, thank you for raising this important point. We have a repo that tracks the preprocessing scripts for the majority of the datasets: https://github.com/kexinhuang12345/data_process however it is not cleaned up yet. I think it is important to make sure the data provenance is good and we would work towards that by linking to these processing scripts in the website.

As for the ChEMBL, unfortunately, the processing script seems to be missing. To address that, we plan to release the most up to date ChEMBL version in the coming release and document the chembl version on the website. If you have already used the current data, for now, you could call it TDC version to make things clear. Hope this helps!

kexinhuang12345 commented 2 years ago

Hi, ChEMBL-V29 is now released in 0.3.5. You can load it via:

from tdc.generation import MolGen
data = MolGen(name = 'ChEMBL_V29')