Closed siboehm closed 2 years ago
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Still TODO: Properly encode the control embedding. @MxMstrmn we need the SMILES string for the drug that was used as control in both Trapnell and LINCS. Ideally we'd just add this to the dataset (right now the SMILES column is empty for drug==control).
Other than that this is ready to go now.
According to Mo the control used in Trapnell & LINCS is DMSO: CS(C)=O
. In trapnell_cpa.h5ad
the "control" condition has an empty SMILES, I adjusted it via:
adata.obs["SMILES"] = adata.obs["SMILES"].cat.rename_categories({"": "CS(C)=O"})
In LINCS DSMO already has the correct SMILES set.
This has become much simpler since we first talked about #15 a few weeks ago. Mainly it just loads the dataframe with the precalculated embeddings, extracts the embeddings for all relevant SMILES in the correct order and converts that to a
torch.nn.Embedding
.Notes:
This PR is based on #23, which should be merged first. I'll go through this PR again tomorrow and add a test or two to make sure the sorting works as intended.
Closes #15