theislab / chemCPA

Code for "Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution", NeurIPS 2022.
https://arxiv.org/abs/2204.13545
MIT License
104 stars 24 forks source link

chemCPA vs chemCPA pretrained #147

Closed bhomass closed 9 months ago

bhomass commented 1 year ago

sorry to belabel some of the details. similar to the question about baseline, I want to make sure I know the meaning of the tests being compared to.

in figure 2 results, ChemCPA means training sciplex data from scratch as outlined in https://github.com/theislab/chemCPA/tree/main/experiments step 3

  1. Pretrain on LINCS (~900 genes), finetune on Trapnell (same ~900 genes)
  2. Pretrain on LINCS (~900 genes), finetune on Trapnell (2000 genes)
  3. Train from Scratch on Trapnell (900 genes)
  4. Train from Scratch on Trapnell (2000 genes)

whereas chemCPA pretrained is step 1.

is that correct?

I am wondering if you are staying with the same gene set of 977, and only adding a little over 100 drugs, does the fine tune really make much difference. another word, the result for predicting ood may already be good enough from the pretrained model. just apply to the sciplex drugs which were in the lincs dataset.

MxMstrmn commented 1 year ago

Hey @bhomass,

the crucial difference between the datasets is the gene expression readout. The bulk (LINCS) and single-cell (sciplex) data come from different experimental assays and have distinct characteristics. So the idea is to train on the bulk data first (where many drugs were tested) and then transfer this "knowledge" to the single-cell setting.

bhomass commented 9 months ago

it sounds like if you were running the fine-tuning experiment with the common 977 genes and without loading the pretrained model, then, the chemCPA (non-pretrained) experiment would only be using the 188 sciplex drugs. The representation would be extremely underwhelming. Is that indeed what these "chemCPA" numbers in tables 1, 2, and 3 meant?