mims-harvard / TDC

Therapeutics Commons (TDC-2): Multimodal Foundation for Therapeutic Science
https://tdcommons.ai
MIT License
1.01k stars 174 forks source link

Potential data leakage of ContextPred and AttrMasking on the ADMET group benchmark #166

Closed lihan97 closed 2 years ago

lihan97 commented 2 years ago

The ContextPred and Attrmasking methods on the ADMET leaderboard were pre-trained on the ChEMBL dataset (https://doi.org/10.1039/C8SC00148K) in a supervised manner. The ChEMBL dataset contains 1310 biochemical assays, in which CHEMBL1741321 corresponds to CYP2D6_Veith, CHEMBL1741324 corresponds to CYP3A4_Veith, CHEMBL1741325 corresponds to CYP2C9_Veith, CHEMBL1909136 corresponds to CYP2D6_Substrate_CarbonMangels, CHEMBL1909135 corresponds to CYP2C9_Substrate_CarbonMangels and CHEMBL1909138 corresponds to CYP3A4_Substrate_CarbonMangels (see Table 2 in https://doi.org/10.1039/C8SC00148K). Please check that for the potential data leakage.

kexinhuang12345 commented 2 years ago

Thanks for pointing this out. We would like to create an issue with the DGL lifesci github to make sure if they are indeed included in the pertaining procedure. If that is the case, we would add a note to these two baselines in the CYP-based benchmarks. Stay tuned

kexinhuang12345 commented 2 years ago

It looks like there is indeed potential data leakage. Thanks for pointing it out! We are removing these two methods from the affected benchmarks.