theislab / chemCPA

Code for "Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution", NeurIPS 2022.
https://arxiv.org/abs/2204.13545
MIT License
97 stars 24 forks source link

Clarifications about LINCS data / evaluation #159

Open rvinas opened 8 months ago

rvinas commented 8 months ago

Hi, I would be grateful if you could clarify the following points regarding LINCS data:

  1. The README file mentions that you used data from Phase I and Phase II. What level (1-5)?
  2. Apart from cell line and compound information (perturbation ID and dose), did you consider any other variables when modeling this data?
  3. How did you compute the differentially expressed genes? Did you compute separate ranks for each cell line?
  4. In terms of evaluation, the baseline in the paper employs the expression decoded from the basal state (i.e. excluding perturbation information). Does this baseline preserve the ground-truth cell line information when decoding the basal state or is this information also removed?
  5. Did you consider computing the R2 scores between the raw control data (i.e. no autoencoder involved) and the ground-truth post-perturbation profiles predicted by chemCPA?

Any clarification on these points would be greatly appreciated.

rvinas commented 7 months ago

@MxMstrmn I would appreciate your answers to the questions above

MxMstrmn commented 6 months ago

Hi @rvinas,

  1. We use level two (the GEX equivalent)
  2. No, for the LINCS data, we only considered compounds, dosage, and cell line information
  3. The differentially expressed genes we approximated by this part of the notebook in 1_lincs.py, L93-L110
  4. The baseline is the composition of basal state + cell line information. Effectively, we simply check how similar to control distribution is compared to the perturbed state
  5. No, I did not make this check myself but relied on the original analysis in the Sci-Plex data