theislab / chemCPA

Code for "Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution", NeurIPS 2022.
https://arxiv.org/abs/2204.13545
MIT License
88 stars 23 forks source link

Include all observations from the sciplex data #65

Closed MxMstrmn closed 2 years ago

MxMstrmn commented 2 years ago

Closes #61

The first notebook simply does the gene matching with the lincs data, ignoring the subsetting from before. The second notebook is an updated version of the addition of SMILES strings to the .obs dataframe. As a result all .h5ad files in the PROJECT_DIR/'datasets' folder are updated and ready to be used in our model sweeps.

review-notebook-app[bot] commented 2 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

siboehm commented 2 years ago

Looks good, a true data printing machine. I may schedule a few runs just to check.

  1. Do we have the gene_id stored in both LINCS & Trapnell? We'll need this for the surgery.
  2. Let's hope that these are still the exact same SMILES now, else we'll have to redo all our embeddings.
MxMstrmn commented 2 years ago
  1. Yes, we have. Just access .var.gene_id
  2. They are the same smiles as I am loading them also from the adata_cpi, it is just that I perform the matching via a ictionary (drug_name, smiles) which imo is the preferable method.
siboehm commented 2 years ago

There was a problem with the new dataset, resulting in JQ1 getting assigned a NaN Smiles (since it was renamed after the dict mapping had been applied). I fixed it, but don't have the permission to save the dataset. @MxMstrmn can you look through it and just run the notebook again? That'll save the updated version to storage