Remove One-Hot-Encoding

theislab / chemCPA

Code for "Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution", NeurIPS 2022.

https://arxiv.org/abs/2204.13545

MIT License

99 stars 24 forks source link

Remove One-Hot-Encoding #20

Closed siboehm closed 2 years ago

siboehm commented 2 years ago

After loading a dataset the current code generates a OHE for the drugs. This takes way too long on LINCS, since there are 17K drugs and since the OHE generation runs in a single CPU thread.

This should be sped up, either by removing the OHE encoding and working with indices instead, or by somehow speeding up the OHE generation.

siboehm commented 2 years ago

Additional issue: On LINCS the adversarial classifier is solving a classification task over 17K classes. Skimming through the distribution of perturbations is seems that there are only 1500 drugs that appear more than 100 times.

Therefore I don't think it makes sense to pretrain on LINCS using the same adversarial classifier as on Trapnell. Alternatives:

Group the drugs somehow into larger classes
Have the classifier predict an embedding. There we can use any of the embeddings we'll implement as part of #15

MxMstrmn commented 2 years ago

Totally agree on that point, my intuition is that a clustering on the latent space of the Grover model would be a good idea. This can be done like this:

# load embedding 
df = pd.read_parquet(DATA_DIR / 'embeddings' / 'grover_embedding')
X = df[[f'embedding_{i}' for i in range(embedding_dim_grover)]].values
adata = AnnData(X)
adata.obs_names = df.index # this should be smiles 
# And then the normal worflow via `pca` -> `neighbors` -> `umap` -> `leiden`