Clarifications about LINCS data / evaluation

rvinas commented 8 months ago

Hi, I would be grateful if you could clarify the following points regarding LINCS data:

The README file mentions that you used data from Phase I and Phase II. What level (1-5)?
Apart from cell line and compound information (perturbation ID and dose), did you consider any other variables when modeling this data?
How did you compute the differentially expressed genes? Did you compute separate ranks for each cell line?
In terms of evaluation, the baseline in the paper employs the expression decoded from the basal state (i.e. excluding perturbation information). Does this baseline preserve the ground-truth cell line information when decoding the basal state or is this information also removed?
Did you consider computing the R2 scores between the raw control data (i.e. no autoencoder involved) and the ground-truth post-perturbation profiles predicted by chemCPA?

Any clarification on these points would be greatly appreciated.

rvinas commented 7 months ago

@MxMstrmn I would appreciate your answers to the questions above

MxMstrmn commented 6 months ago

Hi @rvinas,

We use level two (the GEX equivalent)
No, for the LINCS data, we only considered compounds, dosage, and cell line information
The differentially expressed genes we approximated by this part of the notebook in 1_lincs.py, L93-L110
The baseline is the composition of basal state + cell line information. Effectively, we simply check how similar to control distribution is compared to the perturbed state
No, I did not make this check myself but relied on the original analysis in the Sci-Plex data

theislab / chemCPA