snap-stanford / GEARS

GEARS is a geometric deep learning model that predicts outcomes of novel multi-gene perturbations
MIT License
189 stars 38 forks source link

Perturbations are not in the GO graph #54

Closed niklasbinder closed 6 months ago

niklasbinder commented 6 months ago

Hello, I encountered following issue: I created my own adata object with 4 perturbed transcription factors: my perturbations are not in the GO graph. how do I fix this. Here is the code from gears import PertData

pert_data = PertData('./data') # specific saved folder pert_data.new_data_process(dataset_name = 'men', adata = adata) # specific dataset name and adata object pert_data.load(data_path = './data/men') # load the processed data, the path is saved folder + dataset_name pert_data.prepare_split(split = 'simulation', seed = 1) # get data split with seed pert_data.get_dataloader(batch_size = 32, test_batch_size = 128) # prepare data loader

Found local copy... Found local copy... Creating pyg object for each cell in the data... Creating dataset file... 100%|██████████| 5/5 [00:25<00:00, 5.13s/it] Done! Saving new dataset pyg object at ./data/men/data_pyg/cell_graphs.pkl Done! Found local copy... These perturbations are not in the GO graph and their perturbation can thus not be predicted [] Local copy of pyg dataset is detected. Loading... Done! Local copy of split is detected. Loading... Simulation split test composition: combo_seen0:0 combo_seen1:0 combo_seen2:0 unseen_single:0 Done!

here1


KeyError Traceback (most recent call last)

in <cell line: 7>() 5 pert_data.load(data_path = './data/men') # load the processed data, the path is saved folder + dataset_name 6 pert_data.prepare_split(split = 'simulation', seed = 1) # get data split with seed ----> 7 pert_data.get_dataloader(batch_size = 32, test_batch_size = 128) # prepare data loader

/usr/local/lib/python3.10/dist-packages/gears/pertdata.py in get_dataloader(self, batch_size, test_batch_size) 453 for i in splits: 454 cell_graphs[i] = [] --> 455 for p in self.set2conditions[i]: 456 cell_graphs[i].extend(self.dataset_processed[p]) 457

KeyError: 'val'

XiaoMi93 commented 6 months ago

When I just provide the unperturbed data, I encountered the same question as you did. I guess this is because when you provide a new dataset, model need to be trained again using perturbations encoded in the GO graph as the validation set. I wonder whether I can only provide the unperturbed data and use the trained model to directly predict the perturbations.

yhr91 commented 6 months ago

Hi, thanks for your question @niklasbinder. I don't think the problem here is that the perturbations are not in the GO graph. If that were the case those perturbations would have been listed after These perturbations are not in the GO graph and their perturbation can thus not be predicted

I think the issue is that you have very few perturbations and this is affecting the ability to create a meaningful data split. Can you share more about what your adata looks like and what is the purpose of this training (i.e. do you want to predict for new unseen perturbations). In any case, training on just 4 perturbations may not work so well

niklasbinder commented 6 months ago

Thank you @yhr91 for your response! I intend to predict new unseen perturbations of transcription factors I identified from other tools. I tried different perturbations and it seems to work now, even better if I also increase the number of perturbations. Is there a efficient way of showing the effect of the perturbations on specific clusters or plot differences in the effects on different clusters?

yhr91 commented 6 months ago

Sorry, I don't have great answers to your final question. This is an active area of research for us as well. Conventional approaches such as differential expression that go beyond qualitative analyses over UMAP are often the most effective. Good luck!