Closed bhomass closed 9 months ago
Hey @bhomass,
the crucial difference between the datasets is the gene expression readout. The bulk (LINCS) and single-cell (sciplex) data come from different experimental assays and have distinct characteristics. So the idea is to train on the bulk data first (where many drugs were tested) and then transfer this "knowledge" to the single-cell setting.
it sounds like if you were running the fine-tuning experiment with the common 977 genes and without loading the pretrained model, then, the chemCPA (non-pretrained) experiment would only be using the 188 sciplex drugs. The representation would be extremely underwhelming. Is that indeed what these "chemCPA" numbers in tables 1, 2, and 3 meant?
sorry to belabel some of the details. similar to the question about baseline, I want to make sure I know the meaning of the tests being compared to.
in figure 2 results, ChemCPA means training sciplex data from scratch as outlined in https://github.com/theislab/chemCPA/tree/main/experiments step 3
whereas chemCPA pretrained is step 1.
is that correct?
I am wondering if you are staying with the same gene set of 977, and only adding a little over 100 drugs, does the fine tune really make much difference. another word, the result for predicting ood may already be good enough from the pretrained model. just apply to the sciplex drugs which were in the lincs dataset.