Closed siboehm closed 2 years ago
Leon does the clustering, Simon writes the predictor
A bit unclear to me: how many different clusters are sensible given that we have 17k different drugs?
Number of cluster is probably less important than the number of samples per cluster. For reference: These are the number of samples for the drugs with the lowest occurrence in Trapnell:
> adata.obs["condition"].value_counts()[-10:]
Rigosertib 984
Luminespib 980
Tozasertib 975
Mocetinostat 949
Alvespimycin 930
AT9283 910
Patupilone 757
Flavopiridol 693
Epothilone 583
YM155 394
Name: condition, dtype: int64
So it might be nice to have >500 samples per cluster in LINCS. I was somewhat surprised to see that LINCS isn't even that big: 840,677 samples in LINCS vs 290,888 samples in Trapnell.
@siboehm I do not remember, did we merge my cluster assignments already?
No. You did generate the cluster afaik, but we never merged any code. So far all I've done is port it from a Binary Cross Entropy loss to a Standard Cross Entropy loss, as BCE requires you to create a OHE target vector, which again would take way too long.
Currently the CPA Adv predictor outputs a probability distribution over all possible drugs, given the basal state. This works well for datasets like Trapnell (188 drugs) but for LINCS (>17K drugs) it is not feasible because there aren't enough samples of each drug.
Therefore this needs to be adapted:
Different approaches are possible during finetuning:
The final CPA model is used to predict counterfactuals. For this to work all information about the drug has to have been removed from the latent basal state. This may not fully be the case when we predict just the cluster assignment. Example: If there are potent and less potent drugs in each cluster, then a notion of potency may remain in the latent basal state even though the cluster cannot be predicted anymore. If we use strategy 1) during finetuning we'll definitely ensure "latent basal state drug ambivalence" for the final model. An alternative approach may be to have the Adv Predictor predict the drug embedding directly (using a smaller, <1000 dim drug embedding) and using a cosine distance to the "true" drug embedding as the adversarial loss. This is how BERT models predict words. For now the cluster strategy seems more promising.