Closed JulianRein closed 1 year ago
Thanks for your comment. We actually tested this out and indeed varying the GO-hierarchy did impact the predicted perturbation outcomes. In our case, the difference wasn't too significant but this may vary depending upon specific use cases and gene sets being perturbed.
@JulianRein you can also consider weighing terms by frequency -- I found that something like the following works a bit better in get_go
:
from sklearn.feature_extraction.text import TfidfVectorizer
genes = gene2go.keys()
df = pd.DataFrame({'terms': [' '.join(v) for v in gene2go.values()]}, index=genes)
tfidf = TfidfVectorizer()
# Get GO term weights.
x = tfidf.fit_transform(df.terms)
# Pairwise weighted similarity.
df_sim = pd.DataFrame((x * x.T).todense(), index=genes, columns=genes).round(5).stack()
# Convert to the same format as `edge_list`; threshold may need to be different from 0.1.
tf_idf_edge_list = df_sim[df_sim > 0.1].rename_axis(('source', 'target')).reset_index(name='importance')
@giladmishne Thanks, great idea. I was already removing large terms since they covered too much, but this is way better than a cutoff-filtering.
Hi, I noticed that the current GO-network is constructed by using the overlap (Jaccard index) between genes over ALL GO-terms, regardless of GO-hierarchy (biological process, cellular component, molecular function). This leads to. e.g isoforms having a very high score, since they naturally overlap in all 3 categories together. However, connections by pathways seem to be drowned out, at least in my example: Even direct partners in the (well researched and basic) glycolysis¹ pathway do not make the cutoff value (0.1), like HK (e.g. HK1) and PGI . However, you write you tried other networks in place of GO, like "protein-protein interaction network [21], a gene coessentiality network [41] or the gene co-expression network". So I would be interested in your take on this:
Do you think a separation of GO-networks by hierarchy (and maybe some filtering for term size) and the resulting highlighting of direct interaction partners would help the gene-gene transferability of predictions/perturbations?
Thank you!
PS: If you are interested in ~40% faster code (or a lot faster with joblib.Parallel) for the GO network generation, let me know. Can send a merge request. ¹https://en.wikipedia.org/wiki/Glycolysis