split questions - Githubissues

Thanks for your questions.

Data splits The different splits are described in the preprint, but we will make a note to add more details to the repo.

https://github.com/snap-stanford/GEARS/blob/c7ca19cbcd6f4da3030d0ebc90b2c2cd0b47a8d8/gears/pertdata.py#L192-L194

combo_seen0 creates a test set consisting only of 2-gene perturbations where neither of the 2 genes in any of the test perturbations has been seen perturbed individually during training. combo_seen2 creates a test set where both of the genes involved in every test set perturbation have been seen perturbed during training and combo_seen1 does the same for 1 gene seen perturbed during training. single creates a test set with single gene perturbations that have not been seen during training. simulation is a special type of split that ensures that all these four split categories are included in the same test set. Of course, this only applies to datasets that contain 2-gene perturbations. More details in lines 473-486 of the preprint and also in this figure from the Supplementary Information that helps illustrate the procedure.

no_split puts everything in the test set and no_test puts everything in the training set.

Pre-processing In the current version of GEARS, control cells are paired randomly with perturbed cells. Yes, the datasets are normalized and log transformed, subsetted to the top 5000 highly varying genes (+ genes that have been perturbed).

https://github.com/snap-stanford/GEARS/blob/c7ca19cbcd6f4da3030d0ebc90b2c2cd0b47a8d8/demo/data_tutorial.ipynb#L230-L232

snap-stanford / GEARS

split questions #16