ood drug belinostat leaks to pretraining dataset

theislab / chemCPA

Code for "Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution", NeurIPS 2022.

https://arxiv.org/abs/2204.13545

MIT License

104 stars 24 forks source link

ood drug belinostat leaks to pretraining dataset #149

Closed bhomass closed 1 year ago

bhomass commented 1 year ago

I don't know if you are aware of this. of the 32 ood drugs you designated in sciplex_ood_splits.ipynb, meant for fine-tuning, already exists in the pretraining data which is splitted from lincs_full_smiles_sciplex_genes.h5ad.

On close examination. there are 5 ood drugs in the lincs_sciplex pre-training dataset.

So, if the fine-tuning model loads the pre-trained model then there is a leak through and is not a true ood.

MxMstrmn commented 1 year ago

Hi @bhomass,

Thanks for pointing this out! I will provide new checkpoints in an updated version of this repo where the LINCS data matches the single-cell setting better. While this is not ideal, I checked the number of data points corresponding to these drugs and they are less than 0.3%. Given the strong shift between build and single-cell, I am confident that the results still translate to the "true" old case.