snap-stanford / GEARS

GEARS is a geometric deep learning model that predicts outcomes of novel multi-gene perturbations
MIT License
192 stars 38 forks source link

Question: GO-network(s) with hierarchy separation? #10

Closed JulianRein closed 1 year ago

JulianRein commented 1 year ago

Hi, I noticed that the current GO-network is constructed by using the overlap (Jaccard index) between genes over ALL GO-terms, regardless of GO-hierarchy (biological process, cellular component, molecular function). This leads to. e.g isoforms having a very high score, since they naturally overlap in all 3 categories together. However, connections by pathways seem to be drowned out, at least in my example: Even direct partners in the (well researched and basic) glycolysis¹ pathway do not make the cutoff value (0.1), like HK (e.g. HK1) and PGI . However, you write you tried other networks in place of GO, like "protein-protein interaction network [21], a gene coessentiality network [41] or the gene co-expression network". So I would be interested in your take on this:

Do you think a separation of GO-networks by hierarchy (and maybe some filtering for term size) and the resulting highlighting of direct interaction partners would help the gene-gene transferability of predictions/perturbations?

Thank you!

PS: If you are interested in ~40% faster code (or a lot faster with joblib.Parallel) for the GO network generation, let me know. Can send a merge request. ¹https://en.wikipedia.org/wiki/Glycolysis

yhr91 commented 1 year ago

Thanks for your comment. We actually tested this out and indeed varying the GO-hierarchy did impact the predicted perturbation outcomes. In our case, the difference wasn't too significant but this may vary depending upon specific use cases and gene sets being perturbed.

giladmishne commented 1 year ago

@JulianRein you can also consider weighing terms by frequency -- I found that something like the following works a bit better in get_go:

from sklearn.feature_extraction.text import TfidfVectorizer

genes = gene2go.keys()
df = pd.DataFrame({'terms': [' '.join(v) for v in gene2go.values()]}, index=genes)

tfidf = TfidfVectorizer()

# Get GO term weights.
x = tfidf.fit_transform(df.terms)

# Pairwise weighted similarity.
df_sim = pd.DataFrame((x * x.T).todense(), index=genes, columns=genes).round(5).stack()

# Convert to the same format as `edge_list`; threshold may need to be different from 0.1.
tf_idf_edge_list = df_sim[df_sim > 0.1].rename_axis(('source', 'target')).reset_index(name='importance')
JulianRein commented 1 year ago

@giladmishne Thanks, great idea. I was already removing large terms since they covered too much, but this is way better than a cutoff-filtering.