Closed wconnell closed 1 year ago
Thanks for your questions.
Data splits The different splits are described in the preprint, but we will make a note to add more details to the repo.
combo_seen0
creates a test set consisting only of 2-gene perturbations where neither of the 2 genes in any of the test perturbations has been seen perturbed individually during training. combo_seen2
creates a test set where both of the genes involved in every test set perturbation have been seen perturbed during training and combo_seen1
does the same for 1 gene seen perturbed during training. single
creates a test set with single gene perturbations that have not been seen during training. simulation
is a special type of split that ensures that all these four split categories are included in the same test set. Of course, this only applies to datasets that contain 2-gene perturbations. More details in lines 473-486 of the preprint and also in this figure from the Supplementary Information that helps illustrate the procedure.
no_split
puts everything in the test set and no_test
puts everything in the training set.
Pre-processing In the current version of GEARS, control cells are paired randomly with perturbed cells. Yes, the datasets are normalized and log transformed, subsetted to the top 5000 highly varying genes (+ genes that have been perturbed).
Hi there, I'm wondering what exactly the "simulation" split means? There are a few different splits that I can't quite infer without a definition.
Also, how are ctrl cells paired with perturbed cells for each example?
Additionally, are the downloaded datasets preprocessed (normalized TPM and log transformed, looks like it)?
Overall, I'm trying to get this information annotated in the
pert_data.adata.obs
dataframe rather than use the GEARS data representations.Thanks!