theislab / chemCPA

Code for "Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution", NeurIPS 2022.
https://arxiv.org/abs/2204.13545
MIT License
88 stars 23 forks source link

Add chemical representation #24

Closed siboehm closed 2 years ago

siboehm commented 2 years ago

This has become much simpler since we first talked about #15 a few weeks ago. Mainly it just loads the dataframe with the precalculated embeddings, extracts the embeddings for all relevant SMILES in the correct order and converts that to a torch.nn.Embedding.

Notes:

This PR is based on #23, which should be merged first. I'll go through this PR again tomorrow and add a test or two to make sure the sorting works as intended.

Closes #15

review-notebook-app[bot] commented 2 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

siboehm commented 2 years ago

Still TODO: Properly encode the control embedding. @MxMstrmn we need the SMILES string for the drug that was used as control in both Trapnell and LINCS. Ideally we'd just add this to the dataset (right now the SMILES column is empty for drug==control).

Other than that this is ready to go now.

siboehm commented 2 years ago

According to Mo the control used in Trapnell & LINCS is DMSO: CS(C)=O. In trapnell_cpa.h5ad the "control" condition has an empty SMILES, I adjusted it via:

adata.obs["SMILES"] = adata.obs["SMILES"].cat.rename_categories({"": "CS(C)=O"})

In LINCS DSMO already has the correct SMILES set.