Add chemical representation

siboehm commented 2 years ago

This has become much simpler since we first talked about #15 a few weeks ago. Mainly it just loads the dataframe with the precalculated embeddings, extracts the embeddings for all relevant SMILES in the correct order and converts that to a torch.nn.Embedding.

Notes:

I changed Grover to output just a single dataframe. This means we don't have to specify the dataset when we load the embedding. The dataframe needs to contain all relevant SMILES (LINCS + Trapnell for now).
So far it doesn't support combinations of perturbations, mainly because I didn't have a dataset with combinations of drugs and SMILES. Incorporating this involves adding ~2 lines.

This PR is based on #23, which should be merged first. I'll go through this PR again tomorrow and add a test or two to make sure the sorting works as intended.

Closes #15

review-notebook-app[bot] commented 2 years ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

siboehm commented 2 years ago

Still TODO: Properly encode the control embedding. @MxMstrmn we need the SMILES string for the drug that was used as control in both Trapnell and LINCS. Ideally we'd just add this to the dataset (right now the SMILES column is empty for drug==control).

Other than that this is ready to go now.

siboehm commented 2 years ago

According to Mo the control used in Trapnell & LINCS is DMSO: CS(C)=O. In trapnell_cpa.h5ad the "control" condition has an empty SMILES, I adjusted it via:

adata.obs["SMILES"] = adata.obs["SMILES"].cat.rename_categories({"": "CS(C)=O"})

In LINCS DSMO already has the correct SMILES set.

theislab / chemCPA

Add chemical representation #24