krishnanlab / obnb

A Python toolkit for setting up benchmarking dataset using biomedical networks
https://obnb.readthedocs.io
MIT License
21 stars 0 forks source link

multiple connected components in mondo ontology #495

Open kmanpearl opened 2 months ago

kmanpearl commented 2 months ago

I was expecting the processed mondo ontology to only have 1 connected component but this is not the case. I am not sure if this is because I am misunderstanding the processing steps, I am missing some function call/argument in my code (shown below), or a bug.

from obnb.data import MondoDiseaseOntology
root = '../data/obnb/FullyRedundant'
dat = MondoDiseaseOntology(root=root)
g = dat.data
undirected = g.to_undirected_sparse_graph()
len(undirected.connected_components())
# 3136

There were 2 connected components, one with ~23k nodes and one with 41 nodes, and the rest of the ~3k components are nodes with no edges in the ontology.

This caused a further problem because the term MONDO:0006560 has gene annotations in obnb but no ontology edges, thus when using an edge list to create node embeddings it is not considered part of the ontology. I had to manually remove this term from the gene set collection before I could use my net2onto method with mondo.

If I am misunderstanding and this is not a feature that is implemented, then can we please add a feature that filters ontologies to only contain the largest connected component? Or fix it if it is a bug? And if I am just missing something in my code then please let me know what the proper way to process the ontology is.