snap-stanford / GEARS

GEARS is a geometric deep learning model that predicts outcomes of novel multi-gene perturbations
MIT License
204 stars 41 forks source link

split questions #16

Closed wconnell closed 1 year ago

wconnell commented 1 year ago

Hi there, I'm wondering what exactly the "simulation" split means? There are a few different splits that I can't quite infer without a definition.

Also, how are ctrl cells paired with perturbed cells for each example?

Additionally, are the downloaded datasets preprocessed (normalized TPM and log transformed, looks like it)?

Overall, I'm trying to get this information annotated in the pert_data.adata.obs dataframe rather than use the GEARS data representations.

Thanks!

yhr91 commented 1 year ago

Thanks for your questions.

Data splits The different splits are described in the preprint, but we will make a note to add more details to the repo.

https://github.com/snap-stanford/GEARS/blob/c7ca19cbcd6f4da3030d0ebc90b2c2cd0b47a8d8/gears/pertdata.py#L192-L194

combo_seen0 creates a test set consisting only of 2-gene perturbations where neither of the 2 genes in any of the test perturbations has been seen perturbed individually during training. combo_seen2 creates a test set where both of the genes involved in every test set perturbation have been seen perturbed during training and combo_seen1 does the same for 1 gene seen perturbed during training. single creates a test set with single gene perturbations that have not been seen during training. simulation is a special type of split that ensures that all these four split categories are included in the same test set. Of course, this only applies to datasets that contain 2-gene perturbations. More details in lines 473-486 of the preprint and also in this figure from the Supplementary Information that helps illustrate the procedure.

no_split puts everything in the test set and no_test puts everything in the training set.

Screen Shot 2023-05-20 at 9 34 37 AM

Pre-processing In the current version of GEARS, control cells are paired randomly with perturbed cells. Yes, the datasets are normalized and log transformed, subsetted to the top 5000 highly varying genes (+ genes that have been perturbed).

https://github.com/snap-stanford/GEARS/blob/c7ca19cbcd6f4da3030d0ebc90b2c2cd0b47a8d8/demo/data_tutorial.ipynb#L230-L232