Closed siboehm closed 2 years ago
Additional issue: On LINCS the adversarial classifier is solving a classification task over 17K classes. Skimming through the distribution of perturbations is seems that there are only 1500 drugs that appear more than 100 times.
Therefore I don't think it makes sense to pretrain on LINCS using the same adversarial classifier as on Trapnell. Alternatives:
Totally agree on that point, my intuition is that a clustering on the latent space of the Grover model would be a good idea. This can be done like this:
# load embedding
df = pd.read_parquet(DATA_DIR / 'embeddings' / 'grover_embedding')
X = df[[f'embedding_{i}' for i in range(embedding_dim_grover)]].values
adata = AnnData(X)
adata.obs_names = df.index # this should be smiles
# And then the normal worflow via `pca` -> `neighbors` -> `umap` -> `leiden`
After loading a dataset the current code generates a OHE for the drugs. This takes way too long on LINCS, since there are 17K drugs and since the OHE generation runs in a single CPU thread.
This should be sped up, either by removing the OHE encoding and working with indices instead, or by somehow speeding up the OHE generation.