EXP: OOD single cell drug prediction `finetuning_OOD_prediction`

siboehm commented 2 years ago

Summary

Test how much pretraining on LINCS helps with improving OOD drug prediction on Trapnell.

Why is this interesting?

Would allow accurate predictions of single-cell response to unseen drugs, without spending more money on the datasets.

Implementation (precise)

Pick 1-3 drugs that exist in both LINCS and Trapnell. They should be drugs that have a large effect on the transcriptome like Quisinostat (epigenetic), Flavopiridol (cell cycle regulation), and BMS-754807 (tyrosine kinase signaling).
Pretrain 2 models:
1. One model that is trained on all the LINCS data.
2. One model that is trained on the LINCS data, with the 30 drugs to be tested left out.
Finetune the pretrained models on Trapnell (3 splits, each has 10 drugs left out)
Train a model on Trapnell (3 splits, each has 10 drugs left out) without pre-training on LINCS
Calculate the R2_score for the drugs that were left out.

Ideal outcome

The pretrained models perform better than the non-pretrained model. The model that has seen the hold-out drugs on LINCS performs better than the pre-trained model that hasn't seen the drugs before.

siboehm commented 2 years ago

Update on the 3 splits that I already had, where one drug is left out: Quisinostat (epigenetic), Flavopiridol (cell cycle regulation), and BMS-754807 (tyrosine kinase signaling).

Neither of these 3 exists in LINCS. However there are drugs in LINCS that are very similar, even though they don't match directly. I added a notebook to analyze these things more efficiently in #73.

Example for Quisinostat:

left is the Trapnell drug, right is LINCS. Tanimoto similarity is 1.0, but it's not the same molecule.

siboehm commented 2 years ago

Ideally we'd leave out drugs in Trapnell that are very distant from the other Trapnell drugs. That'd should result in the pretrained score being much better than the non-pretrained score

MxMstrmn commented 2 years ago

@siboehm I was thinking about creating a notebook, that introduces the corresponding split to the trapnell datasets. What do you think? I can then also investigate on the comment I made in #73

Also, I like the above description! For 2., this seems quite involved. One option would be to leave out all drugs that we use for ood and the have only two lincs models in total, not two per ood drug.

siboehm commented 2 years ago

Yes I agree. I think overall there is no need to have super many splits as we can integrate multiple experiments into a single split (for example instead of having 3 splits, where each split has one drug left out it wouldn't make a large difference to have a single split that leaves out all 3 drugs and we'd save time).

siboehm commented 2 years ago

We have everything we need for performing this experiment once #81 is merged, Leon will write the YAML

MxMstrmn commented 2 years ago

Closing, not relevant anymore.

theislab / chemCPA