snap-stanford / GEARS

GEARS is a geometric deep learning model that predicts outcomes of novel multi-gene perturbations
MIT License
189 stars 38 forks source link

Trouble using the outputs obtained from the GEARS model #49

Closed fratajcz closed 6 months ago

fratajcz commented 6 months ago

Hi!

First of all, thanks for producing reproducible code that allows researchers to use your method. However, with such a high-class publication I would have expected some more verbose documentation. I am trying to simulate perturbations of individual genes and am stuck at the last step, evaluating the outputs. So far I couldn't find any mention of it in the tutorials.

After training and loading the model, typing

gears_model.predict([["PARK7"]])

gives me the following output:

{'PARK7': array([9.3185983e-04, 1.2710856e-02, 4.3317977e-02, ..., 3.7023327e+00,
       4.8783151e-03, 2.0390714e-06], dtype=float32)}

Which is of shape (5045,). I understand that this vector should contain the expression changes of 5045 genes. How can I get the info which value corresponds to which gene? This information would be very helpful.

Secondly, if I try to plot the perturbations wit the following line:

gears_model.plot_perturbation("PARK7", "PARK7.png")

I get the following error:

Traceback (most recent call last):
  File "/home/ubuntu/gears/testrun.py", line 34, in <module>
    gears_model.plot_perturbation("PARK7", "PARK7.png")
  File "/mnt/storage/anaconda3/envs/gears/lib/python3.10/site-packages/gears/gears.py", line 441, in plot_perturbation
    adata.uns['top_non_dropout_de_20'][cond2name[query]]]
KeyError: 'PARK7'

PARK7 must be in the gene set, otherwise the previous prediction wouldn't have worked. I also tried PARK7+ctrl and ctrl+PARK7 but got the same error. Any help on how I could get this running?

fratajcz commented 6 months ago

I have now found the answer to the first part of my question, the gene names are accessible under pert_data.gene_names, but only after calling pert_data.get_dataloader(...). The possible query genes can be found in gears_model.pert_list.

However, now I have found another question regarding the output of gears_model.GI_predict(). It can look like the output below:


{'ts': TheilSenRegressor(fit_intercept=False, max_iter=1000,
                   max_subpopulation=100000.0, random_state=1000),
 'c1': 0.31509945367144776,
 'c2': 1.5384396236436881,
 'mag': 1.5703770697832344,
 'dcor': 0.903160207477114,
 'dcor_singles': 0.8644400546004193,
 'dcor_first': 0.8598795792766578,
 'dcor_second': 0.8826629962759972,
 'corr_fit': 0.8837036648607223,
 'dominance': 0.6886328067658705,
 'eq_contr': 0.9741878643429441}

I was expecting values for the individual GI classes shown in Figure 5 in the paper. How does this array tell me if the genetic interaction is a suppression, synergy, epistatis etc? Thank you for clarifying.

pckinnunen commented 6 months ago

I'm sure the authors can clarify, but table 1 and supplementary notes 15+16 explain this. image

fratajcz commented 6 months ago

Thanks @pckinnunen , this actually answers my question. It would however been great for the adoption of this method if the documentation would contain a hint to the table and the supplementary notes.

zhoummmin commented 5 months ago

Hi!

First of all, thanks for producing reproducible code that allows researchers to use your method. However, with such a high-class publication I would have expected some more verbose documentation. I am trying to simulate perturbations of individual genes and am stuck at the last step, evaluating the outputs. So far I couldn't find any mention of it in the tutorials.

After training and loading the model, typing

gears_model.predict([["PARK7"]])

gives me the following output:

{'PARK7': array([9.3185983e-04, 1.2710856e-02, 4.3317977e-02, ..., 3.7023327e+00,
       4.8783151e-03, 2.0390714e-06], dtype=float32)}

Which is of shape (5045,). I understand that this vector should contain the expression changes of 5045 genes. How can I get the info which value corresponds to which gene? This information would be very helpful.

Secondly, if I try to plot the perturbations wit the following line:

gears_model.plot_perturbation("PARK7", "PARK7.png")

I get the following error:

Traceback (most recent call last):
  File "/home/ubuntu/gears/testrun.py", line 34, in <module>
    gears_model.plot_perturbation("PARK7", "PARK7.png")
  File "/mnt/storage/anaconda3/envs/gears/lib/python3.10/site-packages/gears/gears.py", line 441, in plot_perturbation
    adata.uns['top_non_dropout_de_20'][cond2name[query]]]
KeyError: 'PARK7'

PARK7 must be in the gene set, otherwise the previous prediction wouldn't have worked. I also tried PARK7+ctrl and ctrl+PARK7 but got the same error. Any help on how I could get this running?

Very glad to see your comments! May I inquire if you have managed to resolve the second issue here? I have the same confusion. Thank you very much !

ManuelMoradiellos commented 3 weeks ago

Thanks @pckinnunen , this actually answers my question. It would however been great for the adoption of this method if the documentation would contain a hint to the table and the supplementary notes.

Hi! I've also wanted to give GEARS a shot and the code was useful and reproducible, but I agree with you that the documentation is not really clear regarding results' interpretation.

And even though in the Supp. material you can find the metrics and the thresholds used by the authors to classify a given interaction to one of the five subtypes (Table 2. in Supp. Note 16, shown below), I still haven't seen what's recommended to do when an interaction falls under two or more possible subtypes which has happened to me...

image

Has anyone defined a criteria that might be worthy? Thanks in advance